Hacker News new | past | comments | ask | show | jobs | submit login
How much of R is written in R? (r-bloggers.com)
56 points by g-garron on Aug 27, 2011 | hide | past | favorite | 25 comments



I think both lines of code and number of files are terrible metrics to use to compare sizes of code base across three very different languages. I don't have any experience with Fortran, but as a seasoned R and C hacker, I don't think the two languages could be much more different.

Well written R code tends to be incredibly compact, because the functions available in base R are plentiful and the language is both heavily functional and vector-oriented. The amount of manual memory management and explicit looping required by C easily inflates the lines of code.

A better metric, perhaps, would be to count the number of functions written in each language. Of course, there are issues of style there, but I think that would lead to a more comparable estimate. I don't know of a tool that does that for C - anyone know of one - or should I go the parser route?


R is a pretty terrible language for many reasons, and it always surprises me when it rises to the top on HN.

But in terms of code quality within R itself, I was surprised to find that a number of its .c files are actually machine-translated Fortran, so I'm guessing the author's statistics are not far off.

I discovered this when I decided to confirm my suspicions that my (the?) most frequently used function, t(), which computes the transpose of the matrix, was implemented about as naively as possible. If R developers were really concerned with speed, this is probably the first place to start optimizing.


Can you go into more detail? I've been toying with learning R for ad-hoc analysis but if there is a better alternative worth learning I'd love to hear about it.


Depends on what you mean by "better". MATLAB and python+numpy will almost certainly run faster than R in almost all situations, they are also far more pleasant to program in (in my opinion).

However R has the advantage that it will have support for every obscure statistical analysis routine you can ever think of. It also has better support for reading in data from all kinds of sources and handling things like missing and invalid data. So if your goal is to quickly read in a bunch of data sets (that are small enough that performance isn't a critical issue) from arbitrary sources, run a bunch of statistical functions on that data and turn those results into pretty graphs, then R is pretty great.


Correct. The language R is just plain weird (insane ideas about scoping/binding that seem to be completely unlike anything you've seen in any reasonably designed language in the last 20 years) and not very efficient to boot, unless you hit one of the bits that's just C under the hood.

However, the vast, vast repository of every statistical analysis under the sun - not just 'core R' but every thing that any statistician has hacked up - is unparalleled.

My 'coping with R' strategy is to do all the heavy lifting data manipulation in C/C++/Python, then do one-shot things in R. I just pass csv files around but there are tighter integrations of R and python if you want to look into that.


Just as an aside, "coping with R" would be an awesome concept for a book or series of blog posts.


Or maybe 'Living with R', sort of akin to the self-help-book 'Living with Chronic Fatigue Syndrome' type genre.


Could you go into more detail about the "insane" scoping/binding ideas? I've always found the environment model very straightforward and am wondering if I'm missing something.


What you describe is typically how I use R. It certainly has it limits performance wise, but it is hard to beat as a "stat toolbox". Even in cases where it wasn't up to performance needs, I've found it useful for investigating which methods to use prior to implementing a full-blown solution.

For example, I used the free R package "earth" to confirm that something like MARS(Multivariate Adaptive Regression Splines) is a good approach to a particular analysis. For my client that initial test justified paying Salford Systems for their great, but expensive, CART/MARS software.



Interesting... According to ohloh, Ubuntu has just 4k lines of code: http://www.ohloh.net/p/ubuntu

Must be some pretty dense code!


seems unrelated to ubuntu, its seems to be this project:http://code.google.com/p/zecurrencyconverter/


The HN title's typo is very confusing...


It is certainly a good thing (for speed) that the majority of lines of R are written in C. I used to work in a shop that did much of our development in R. We always used to joke "R is really fast if you write it in C".


R is quite slow, but there are two ways to improve this state of affairs.

1. Use the functional style stuff rather than loops ( i.e. especially the apply family of functions)

2. For large data sets avoid the default memory management which involves loading everything into memory all at once. The sqlite dataframe stuff is probably a good default for larger data sets :)

It's still slow though.


R has always been a niche language/platform for dataists/statisticians and because of that it really hasn't had the contributions of programming language developers and modern techniques.

It'd be cool to see R evolve more quickly as a language (implementation wise). We're getting hints of it with the new byte code compiler though.

That would be ideal.


It's a common misconception that applys are faster than for loops. A for loop is faster than apply(). But lapply(), sapply(), mapply() are plenty fast.


How R competes with other data mining environments:

http://www.kdnuggets.com/2011/08/poll-languages-for-data-min...


Why didn't he use cloc ?


My usual tool is sloccount (http://www.dwheeler.com/sloccount/), but it doesn't identify R as a language (and discounts it entirely, it appears).


How well R competes with SPSS?


In what metric? The GUI of SPSS is better then R (which doesn't have a GUI, though interesting competitors are available like Rcmdr and Deducer). In terms of everything else (performance, graphics, statistical tools, programming language), then from what I understand - R is the winner without a doubt...


I learned R some time ago in university. I've heard of lots of newly graduated colleagues that now work doing "SPSS consulting" (whatever that means) for big businesses. But I can't really see how you could use R to parse financial data, becaule I lack the financial background to understand what it means. Maybe that's what SPSS provides, and what singingfish means with "you don't need to know what you're doing".


Actually, I read his comment as meaning that SPSS was more useful for the sort of "cookbook stat" usually taught in business schools.


SPSS comes from the school of "you don't need to know what you're doing in order to analyse data". R requires that you understand what it is what you want to do in order to make it work.

I quite like the middle ground of JMP (kind of SAS lite), and its a damned sight cheaper than SPSS too.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: