Hacker News new | past | comments | ask | show | jobs | submit login
When to use Excel, when to use R? (michaelmilton.net)
84 points by rams on April 17, 2010 | hide | past | favorite | 43 comments



I would stress that with Excel, changing the underlying data set, especially dimensions of your data is annoying and not very automate-able. This is the case even when the analysis is the exact same or very similar as the original. You either need to copy formulas over for every data set/sheet or write VBA code to automate this, which still isn't as flexible as in R.

Excel is better if you want to pore over the data: study a piece, scroll down, study some more. R is better if you can easily pinpoint the areas you think are interesting and index directly into the array. Excel has indexing capabilities (F5 which goes to an address), but not by row/column names.

Additionally, you can't do multi-step analysis that involves segmentation/sub-setting of the data very easily with Excel. This is related to the first point about changing an underlying data set or its dimensions. There are ways around this, but they are VERY annoying.

I use R for the majority of my analysis. I used to be a Matlab user but needed to find a cheaper tool when I left my job. I still use excel for simple data inspection and when I need to send my analysis to other people to peruse.


"I would stress that with Excel, changing the underlying data set, especially dimensions of your data is annoying and not very automate-able. This is the case even when the analysis is the exact same or very similar as the original. You either need to copy formulas over for every data set/sheet or write VBA code to automate this, which still isn't as flexible as in R."

In certain circumstances you are right. Wanting to use the same spreadsheet for a very, very similar task is often annoying. This can be avoided if you knew from the start that you wanted to make something more reusable, but sadly that isn't always the case. Of course there are ways of automating them (VBA as you mentioned, or DB connections when appropriate, sometimes UDF, or just organizing the data in the correct way) but this is definitely one of my major annoyances.

"Excel has indexing capabilities (F5 which goes to an address), but not by row/column names."

Actually with the name manager you can do this. Simply name the row(s) and column(s) you want to index by, press F5, and type "my_col,my_row" and you are there. It even does the "and" selection for you, which you can easily exit out of if you want. Not sure if it is in versions prior to Excel 2010, but I don't really find myself using this all that much.

"Additionally, you can't do multi-step analysis that involves segmentation/sub-setting of the data very easily with Excel. This is related to the first point about changing an underlying data set or its dimensions. There are ways around this, but they are VERY annoying."

Your first statement is only correct if you mean "you can't do some multi-step analyzes" because some of them are just plain simple. Also, I don't find the occasional annoyances (which typically have to do with using match(), vlookup(), cross_product(), or similar functions) anywhere nearly as annoying as some of the other things I have to deal with when programming. Dates can be a pain in the ass. Even in Ruby.

I'll be clear in saying I have no idea about what R can do. I've only toyed with it due to lack of time, but I do know that Excel can do a bunch of really cool things. Solver, remote UDFs written in any programming language (including Ruby running on a linux box), forms for pointy haired bosses, pretty awesome charts if you know what you are doing, cubes, goal seek, data tables, macros, formula auditing, db connections, etc.

What I don't like is the general vibe that Excel is, and only is, for MBAs and R is, and only is, for comp sci kids. That just isn't true, and I don't have to know anything about R to make that claim because I know some comp sci kids some of the time get a ton of value out of Excel. Right tool, right job.


The arguments are good and sound, but could be simpler:

If you can "grasp" your data by looking at it, use excel. If your data is too much to look at, use R.

Excel lends itself more to shallow exploratory approaches, while with R, you have to think first (or rather: always).


What about a graphical front end to R? I wonder if 90% of R's use cases could be represented in the form of a flowchart diagram, which then would generate R code. One could have the automatically re-run the script when the diagram changes and render graphs and other output. I was thinking of having a "remote control" for such an app on the iPad, with the ability to use the keyboard/mouse and script certain things by hand to enable the more esoteric 10%.



How about rattle? http://rattle.togaware.com/


I'm not entirely sure what kind of analysis you are thinking about. Statistics in R for me entail a lot of human intuition, interpretation and decisions. You probably don't want a script building your models for a confirmative factor analysis, for example.


You mean Gnumeric!


Please don't use spreadsheets for statistical analysis (even the simple kind that the spreadsheet claims to support).

http://www.jstatsoft.org/v34/i04/paper


I'm just starting to go beyond Excel, but I chose to learn Python and scipy over R due to reading that R's scripting language didn't match up to Python. Am I handicapping myself by ignoring R?


You could always use sage if you want scientific computing in python. It makes interacting between different libraries like SciPy, NumPy, and a ton of other math libraries pretty strait forward. You can even choose your version of numbers if you want. It also has rpy.

www.sagemath.com


I don't think you are handicapped, but it would be good to familiarise yourself with R. R as a language may not match up to Python, but it is quite powerful. You can do a lot in few lines of code. And I think it's quite easy to pick up the basics.


R is not such a terrible language, though it is quirky. Unfortunately, yes, you are kind of handicapping yourself. The libraries behind R are, for statistical purposes, infinitely better than Python.


Anyone know if RPy is any good?

http://rpy.sourceforge.net/


have you figured out how to set up the whole toolchain of scipy + lapack + GotoBLAS (or something) ?


The arguments are good, assuming you're the only person working on the analysis.

As a consultant I'm always constrained to use the greatest common denominator - that is, something that my clients can use, modify and extend themselves after I leave. In 99.9% of the cases, that means Excel, regardless of the task.


While your clients need to look at the output of the analysis or the end result, they don't need to see every branch of it that you have explored.

I will do serious exploratory analysis in R and then show my conclusions and maybe a few alternatives in Excel, but that doesn't mean you have to always work in Excel. Further, I could not be anywhere near as productive in Excel as I could in R.


More often than not people are interested in tools and ability to generate answers themselves, not just answers.


A very valid point.


So here's a hypothetical question: let's say you have a programmer who knows next to nothing about statistics, knows how to navigate around Excel fairly well, and once allowed himself to be scared away by R.

Let's further assume he's suddenly inherited a large volume of load testing data and a mandate to "make something out of it."

What's a starting point? I, I mean, he hears about the great visualization stuff in R and understands the importance of it but has no clue where to start.

Would HN have any advice for him?


His book "Head First Data Analysis" seems a fairly gentle introduction to data analysis. I've read it and it's enough to learn a couple of tricks. Where to go from here it's harder to figure out. I'm refreshing my stats classes with the Pearson, and looking for a good introductory R book.

This stats course from Berkeley isn't bad, too: http://www.stat.berkeley.edu/classes/s133/schedule.html


Thanks :)


Since the person in question is a programmer: RTFM

I'd say R is much closer to how a software developer is used to work than a spreadsheet. But I understand that many people who have no practice in software development would rather choose a spreadsheet in such a situation.


Thank you, your response managed to be completely useless while being simultaneously condescending and making you look like an ass who is incapable of grasping a simple point; that I have no experience in statistics, but a great deal in programming, and don't know where to start.

Here, let me rephrase it monosyllabically so that you're capable of understanding the question:

Which "FM" would be good for me to "R"?


"An Introduction to R" could be a good place to start along with the other manuals: http://cran.r-project.org/manuals.html (I think you can also access these through the help menu when interacting with R via the GUI.)

Also, to make nice looking plots, you might as well dive straight into Hadley Wickham's ggplot2 once you've gotten through the very basics: http://had.co.nz/ggplot2/


I assume you want to visualize the data but don't want to do any fancy statistics:

1. Tutorial

2. Language definition

3. Import/Export + sqldf or rodbc

4. Either Lattice[1] or ggplot2[2], and [3]

I'm glad I could be of help. :-)

[1] http://lmdvr.r-forge.r-project.org/figures/figures.html

[2] http://had.co.nz/ggplot2/

[3] http://addictedtor.free.fr/graphiques/


Thank you for your unexpectedly calm and helpful response to my bad-mood snarkotron. :)


MBA: Excel

CS: {R, Python, Matlab, Mathematica}


Where do the Statisticians go? With the CS group, I'm sure, but more specifically, my understanding is that a statistician created R, and thus it suits the statistical mind better than Matlab and Mathematica.


Applied Statisticians: {R, SAS, SPSS}

Basic Stats Research: Same as CS people.

There's a ton of overlap between stats faculty and cs faculty these days - machine learning (usu. within AI) being the field of overlap in CS.


CS people might be put off by the fact R is really a pretty lousy language if you've seen any other language (like, for example, Python). It's obvious that it wasn't a programmer who designed the language R uses.

This blog posts points out a list of specific gripes about the language: http://tjic.com/?p=10739 There is some real weirdness to this language.

The R Programming Language for Programmers: http://www.johndcook.com/R_language_for_programmers.html does a lot to ameliorate this, but the fact the weirdness is there at all makes a programmer a little uneasy.


The first article is incorrect in saying that "user_info" will not behave as you'd expect. (I guess the author didn't try it out?) There are definitely some deeper issues, but it seems wrong to call out something as not being programmed by a "programmer" based on syntax. (Underscore (drawn as arrow) is shared by Squeak Smalltalk, at least.)

The second article probably would've done better to use m <- matrix(1:6, nrow=2) and m <- matrix(1:6, nrow=2, byrow=TRUE) as examples. Maybe matrices and arrays are implemented as contiguous ranges of memory underneath -- vectors -- but the language/standard library has plenty of functions dealing with them at a higher level, and you can live entirely within the abstraction if you like (although you may pay a price in speed, especially when constructing the initial structure). ("array" does not take a "by.row" argument.)

I guess I agree with the overall sentiment, but I find most languages to be lousy for one reason or another. (Python certainly has some warts, some in common with R -- lack of consistent naming conventions in the standard library for one.) The vector-oriented functions, convenient syntax for regression, the large library of stat functions, the packages, etc., make the pain worth it. And if R is from non-programmers, wait till you see SAS.


Think of it as a DSL, statisticians like to use. While it has its dark sides, its vector-orientation is well suited for a language whose main purpose is data manipulation.


> Think of it as a DSL

I can't let this stand as a blanket defense of crappy languages. We have Lua. We have Python. We have Scheme. We have multiple other languages that can be used as a good way to access the goodies provided by your wonderful hand-crafted libraries regardless of what they do. There is no reason for you to invent your own language just so people will be able to use your library code from a REPL.

Sadly, back in the dark ages, this was not true. However, in the future, anyone who tries to defend the horrible design of a new language with "Think of it as a DSL" gets to debug a 1000 KLOC application written in ANS COBOL 1968, which is a DSL for fixed-field database munging.


Let's put things in perspective: R is S and S had it's first appearance in the 1970s. It's current reincarnation is from the late 1980s. I remember that other statistical packages back then had a similar looking syntax. This is also the reason why lua/python stand no chance against R: they simply lack the vast abundance of statistical packages and tutorials. The best thing they can do is use R as an inferior process (IIRC there is a package for incanter that does that).

The point is though that S/R isn't that badly suited for what it was created for. There are a few pitfalls but those are explained in the langage definition. But unfortunately nobody RTFM nowadays (like people used to do back then in what you call the dark ages) because it's easier to start screaming for Mommy and write stupid blog posts that prove nothing but that those people haven't read the language definition.

And BTW, python sucks.


I find R syntax to be really annoying, and I am trying to work more in Octave, a matlab compatible system. I don't like the mostly gratuituous complexity of the R type hierarchy, and prefer to have everything be a matrix. I also detest the use of periods as plain text. FWIW.

http://www.octave.org


Honestly, once you learn them both, it's best to use Illustrator when you need good presentation and R all other times. I can think of very little that Excel can do that can't be done faster in R and prettier with Illustrator.

So I'd condense the argument to: if you're not comfortable with R/Illustrator and don't have time to learn, use Excel.


can you easily update data feeds to illustrator without having to redo your drafted work?


It's generally for finishing and publication. For rapidly updating graphics, the libraries lattice and ggplot2 are both good enough looking.


matplotlib is my best friend, matlab is my second best friend

excel? are you kidding me


Yep, I use matplotlib pretty much exclusively for this kind of thing. I have my eye on clojure's Incanter, though.


Incanter is really shaping up. I can't say enough good things about it.


has anyone used both R and matlab/octave?

i've used matlab but not R. For machine learning, I found matlab great because of its great matrix support.

what's the pros/cons of R? what advantages does R have over matlab?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: