Hacker News new | past | comments | ask | show | jobs | submit login
R 4.0 (ethz.ch)
306 points by _fnhr on April 24, 2020 | hide | past | favorite | 162 comments



This alone is reason to upgrade -> "R now uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table()."


This is one primary reason I started using read_csv from readr than read.csv of base-R. Everytime you teach someone to read a csv using base-R this always came up somewhere in the middle of an analysis because those strings were read as Factors.


Yes, factors absolutely can be a pain in the ass when they're instituted too early in an analysis. It's better to keep them as strings for as long as possible and only convert to factor when you've cleaned up your data. Otherwise, you have to deal with annoying and confusing factor manipulation.

The drawback is that character takes up so much space, but these days memory is so bountiful it usually doesn't matter.


Nowadays strings are stored in a "string pool" anyway, so if you have a string that can be turned into a factor (i.e. with few unique variants), you probably don't need to.


But readr gives you a tibble instead of a plain data.frame and that adds a bunch of other headaches.


How so? I can't think of any drawbacks of tibbles vs data.frames?


Using tibbles outside of tidyverse can be dangerous.

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1
Imagine now using some complex function from a repository (i.e. BioConductor) that works on data.frames and passing a tibble to it.


Post edit: Preserving the below post, which I think highlights some of the issues in the parent I'm responding to's example, but it turns out he is correct on another issue -- narrowly, in his example, the `[` subset command dispatches differently on tibbles than data.frames, so you can narrowly produce weird behaviour. So to anyone reading, please consider upvoting parent and reading rest of thread

Original post follows:

Right off the bat, the problem is not "using" tibbles, it's that you've incorrectly constructed one by passing the data through the tibble() constructor rather than using as_tibble(). The tibble constructor -- for pretty good reasons in other circumstances that seem crazy to you here because of your intent -- infers that you want the entire data frame to be a single column inside the tibble, called "iris". It does this because it evaluates the variable name passed to the tibble constructor as both the intended column name and the data to be placed inside the column. This demonstrates nesting, which is one of the great features of tibbles and otherwise used for a bunch of stuff.

If you had done `tb_iris <- as_tibble(iris)`, it would have worked fine. `as_tibble()` is the function to convert an existing data structure to a tibble. R is obviously not "type safe" in any way, but you can engage in defensive programming, and one way you can do that is being hyper-aware of the steps you take during type conversions. If you check the documentation for `tibble()`, it tells you explicitly to "Use as_tibble() to turn an existing object into a tibble." Is there a reason you didn't? Imagine this related example:

  my_string <- "10"
  numeric(my_string)
  as.numeric(my_string)
Would we conclude that "using the numeric type can be dangerous" because the constructor interpreted the argument different than the conversion helper?

Second, I suspect you must be using extremely old versions of things, because on more recent versions, your nunique function would fail, not produce 1. I correctly get "Error: Can't find column `Species` in `.data`." This error message is maybe a little confusing if you don't check the structure `str(tb_iris)` of tb_iris to see what I mentioned above, but is the correct error to output in light of it. You'd also be able to flag this by just checking `colnames(tb_iris)` or `View(tb_iris)` if you're working in RStudio or using the embedded environment pane or really any other way of looking at the data.

But your broader point is also false. Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes. The only thing that makes a tibble different than a data.frame is that it has an additional class label. All dispatches that work on data.frame objects work on tibbles because of how multiple classing works in R. This has been a goal since the beginning of tibble. The one exception I'm aware of is external functions that incorrectly check `if(class(obj) == "data.frame")` instead of using `is.data.frame()` or `if("data.frame" %in% class(obj))`. The former is and always has been incorrect because of how multiple dispatch is designed to work in R and should generate an error with multi-classed objects because the if statement evaluates to a vector of logicals instead of a logical.

Once way you can tell that tibbles and data frames are identical save the above caveat is to run the following code:

  df_iris <- iris
  tb_iris <- as_tibble(iris)
  identical(df_iris, tb_iris)
  class(tb_iris) <- "data.frame"
  identical(df_iris, tb_iris) 
Note that you are not "downconverting" a tibble into a data.frame in this code (but that would work too) -- you are taking the tibble exactly as is and hacking its class label to look like a data frame. It's identical because a tibble was always a data frame.


I think everything you wrote here is false, so I am not sure how to reply. Will try to keep it respectful and short:

First, about the as_tibble - it returns the same thing as tibble:

    tb_iris <- as_tibble(iris)
    length(unique(tb_iris[,"Species"]))
    > 1
Second, about the incorrect version:

    > packageVersion("tibble")
    [1] ‘3.0.1’
Which is also the current version on CRAN.

Third, about the classes:

You say:

> Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes.

This is not the case. You can add any class to any object in R S3 system. So people behind tibble can call their tibble a data.frame but it gives no guarantee that it will behave like one.

More about this problem here (and you can also find replies from tidyverse authors) https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...


Actually your reply was very helpful because it surfaced ways in which you were partially right and I was partially wrong.

I highlighted the nesting issue in constructing versus coercing (which is correct and does have implications for what you're trying to do) but actually in your example the distinction is broken because of a different edge case

Which is to say the following:

  ncol(iris) # 5
  ncol(as_tibble(iris)) # 5
  ncol(tibble(iris) # 1

  iris$Species # Works
  as_tibble(iris)$Species # Works
  tibble(iris)$Species # Errors because of nesting

  iris[, "Species"] # Works
  tibble(iris)[, "Species"] # Doesn't work
  as_tibble(iris)[, "Species"] # Works
 
However, you're correct that because the subset operator for tibble doesn't drop dimensions, length gets you the number of columns rather than the number of observations. This does speak to the fact that length is a pretty shitty function to begin with, but I concede you're partially correct there.

You are also correct that because class labels are not contractual, there is no guarantee that having the data.frame fallback label means stuff behaves identically (for instance, you could add the data.frame label to any data structure and the data.frame dispatch stuff would not work properly). My point was that in the case of a tibble, a tibble is literally a data frame with an additional class label. If you remove that class label, it's exactly identical.

But your example and linked discussion does highlight a way in which I'm wrong; the subset function is overridden for something with a tibble class label. That's true and could produce edge cases I hadn't considered.

Apologies for any hostility in my original reply.


I'm sorry to report that this analysis is completely wrong, and demonstrates a lack of understanding of the R object model. The class that is provided by tibble does not implement all of data.frame, and the OP is correct.


(S3 -- see footnote) Classes don't "implement" anything in R the way they would in other languages. They are labels that tell dispatch functions how to deal with an object. A tibble is internally a data frame. The last example in my post makes this exactly clear.

The other OO systems in R do act closer to traditional classes, but all the tidyverse stuff is S3.

(But the OP was correct in another sense related to the example narrowly!)


So you're ignoring that the [-function by design works differently for tibbles than for data frames. This isn't really a problem with tibble but with sloppiness in programming allowed by dynamic languages.

I personally think it's a good thing that the drop-argument defaults to FALSE for tibbles, since data frame's default drop = TRUE is a source of frequent bugs. The change of the default for this parameter is the source of your observation.


I am not ignoring it, I am _highlighting_ it. The question of the comment above was "why would one prefer data.frame over tibble". I merely answered that question.


Yes, but the problem isn't tibble since what you're highlighting is a design choice and an argument in favor of tibble. The problem only arises when you're not aware of this design choice which is facilitated by sloppiness and dynmically typed languages.

One might ask whether it was a good idea that tibble enlists data.frame as an inherited class. Since a tibble obviously doesn't behave like a data frame, one could also argue that this is a mistake on part of the tibble developers but this is a different discussion.


All I am saying is that there are perfectly good reasons for not using tibbles if you do any kind of work outside of tidyverse. And you seem to agree?

As for whether or not tibbles should be data.frames - I posted a link to this exact discussion on R-dev mailing list within this thread, as an answer to a different poster. Here it is: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...


Ok, now I understand where you're coming from.


Just put a as.data.frame( ) around it. That's what I do with readxl::read_excel :-)


One of the big reasons why I quit R 10 years ago and never looked back - Python wasn't secretely converting anything, AND failing silently when its not the expected type.


R's come a long way in the last decade. The tibble and data.table packages both address this issue. data.table (https://github.com/Rdatatable/data.table) is the more strongly-typed of the two, by default it fails loudly when it encounters data that doesn't conform to the column type. It's also quite fast--binds to C code that parallelizes with OpenMP. It has very terse and expressive syntax, I find it so much more intuitive and easy to work with than pandas.

If you're happy with Python, by all means keep using it. I use both languages. Just suggesting that if you gave up on R that long ago, you might be pleasantly surprised by how much better it's gotten since then.


vctrs [0] is the latest effort by the Rstudio developers to help people write type-stable code. The R standard library has a lot of issues with silently casting types, but the wonderful thing about it being so scheme-like is that many of these things can be evolved through libraries.

[0] https://github.com/r-lib/vctrs


By python, are you also including pandas? Because it is definitely doing a lot of that!


Exactly! TensorFlow is the only Python data analysis package I've used that doesn't automatically convert things in the background. I was helping a friend with STATA the other day, which doesn't automatically convert, and I realized I've gotten so used to that behavior in R and Python.


i stopped using R a number of years ago because it is not useful for very large datasets (in the tens to hundreds of millions) and now use kdb almost exclusively.


R has quite a few specialized libs to deal with large datasets (out of memory). Nothing keeps you from hosting the data in a DBMS and using SQL (or dplyr) to pull the data in an appropriate format.


I run a data science department at a corporation and this is exactly how we handle our massive amounts of data. It's rare that we're using a billion+ data points in one model so we use SQL to get the data we need in the format we need and move forward from there in R.


The new C++ derived syntax for string literals seems to me to be the top new feature. It will make it possible to support markdown, latex, R code and Windows path literals without munging them first.


If I understand, does this mean on windows when I've copied a file ___location I would no longer have to replace backslash in the file path with double back slash or forward slashes? Or am I off-piste?


Correct. One can write r"(c:\Windows\System32)" . Check out additional examples at the end of the help file: https://github.com/wch/r-source/blob/trunk/src/library/base/...


Awesome


I'm so used to using data.table to import data I forgot this was still a thing


One of the selling points of tidyverse’s readr was “no stringAsFactors = FALSE necessary.” That’s how annoying it is.


You could add to your .Rprofile

  options(stringsAsFactors=FALSE)


This is likely to break things when you share or publish your code.


Upgrading will also break things. And code which depends on defaults will be broken either in new releases or in older releases.


Yes, I mean, we could list things all day that make R a horrible environment for reproducible analysis or production deployment, but it is what it is.


You’re right. My defensive reply was unwarranted.


Upgrading makes the changes explicit, tinkering with your environment variables doesn't.


Upgrading is not different than changing environment variables as far as breaking existing code is concerned.

You can always run R --no-init-file to be sure that you have the default settings. Now you have to know what default settings the code that you want to run expects.


And make your code potentially non-reproducible?


If you care about that don't rely on defaults. This upgrade makes old code non-reproducible, should everyone abstain from upgrading?


Well that is why there is 4.0. Hadley Wickham has had a HUGE influence on R and it is now we have a lot of new things we can use that makes it reproducible in base R.


This makes me want to sing and dance. Ding-dong, the stringsAsFactors witch is dead!


I thought exactly the same thing! Been a long time coming…


My thoughts exactly when I saw that bullet point.


As someone who writes R daily, I’m really excited for 4.0. That said, R still leaves a lot to be desired. Changing the default for stringsAsFactors is great, and I think it reflects a small shift in R from being an exclusively stats-based language to something more general purpose. The nature of stringsAsFactors is that in most statistical models you need your categorical variables to be ordinal.

That said, R still is a stats language by design. In Python or JS for example, you can concatenate strings with ‘a’ + ‘b’ but the + operator in R is explicitly only for numeric types. R also has a horrible architecture for memory management, leading to code that uses profound amounts of RAM. I face this issue constantly as I work with very large datasets.

I have a love/hate relationship with R and despise using it in production. I’m also not a fan of the divergence that Tidyverse has caused. Particularly the expectation of Non-standard evaluation and the tendency for new R learners to become dependent on these packages. Especially as it relates to reproducibility and deploying code, these unnecessary dependencies suck. Tidyverse is maturing and breaking changes are still too common for comfort. There is no reason in my opinion to load stringr when a grep() will suffice. Or, to subset with select when [[ works perfectly fine. Or to filter when subsetting on a logical with which()... the list goes on. Tidyverse is essentially reinventing the wheel in many places. The biggest problem is that it doesn’t translate well to base-R in my experience with new programmers, leading to this divergence.

That said, piping with %>% and modifying directly with %<>%(via magrittr) is a pleasure that other languages I’ve worked with don’t manage as well.

And at the end of the day, I’m not going to rewrite implementations of all the latest statistical methods already written in R, and this is its strong suit. I’m increasingly using sophisticated spatial and spatiotemporal methods, and these methods are solely implemented in R.

I understand that R gets a lot of flack from software developers and I understand why. But, I also think it’s too often overlooked for its strong suits.


Maybe stringsAsFactors was a mistake in the original design, but there is so much code out there reliant on this behavior now and since it was the default you don't really know where the new behavorior will bite you besides looking for calls to data.frame that don't set the parameter.

Plus, it's not such a bad feature when you know it's coming.

As far as the tidyverse goes, I get it now however it seems to discourage the creation of a nice, well organized set of functions to limit the amount you need to keep in your head at the same time, and a lot of R users are very smart people capable of understanding very disorganized code. Instead of functions you get copy/pasted incantations, in Base R it's at least broken down into steps which is a start.


I teach a data science graduate course to non-tech majors, and it's mostly on R. I welcome these changes.

I teach base R first for a few weeks, then I teach the TidyVerse as I introduce data science concepts, like text mining. I convey the TidyVerse as like an overlay on R, improving shortcomings and adding great functionality, syntactic sugar, etc.

This context switch is jarring to some students. It's great to see some of the TidyVerse's strengths--part of the tibble--be moved back into base R.

Now pardon me: I need to go. I start teaching dplyr in 51 minutes!


I must admit to not really understanding the TidyVerse attraction, or why I should be doing "tidy data evaluation", rather than using base R.

If I want to use an advanced data manipulation library, I'd typically reach for data.table. If I want to use a verb based approach, why not SQL rather than dplyr.

I have tried dplyr and code I've written a few years ago still imports that package (hopefully it still works), but I just didn't find it was particularly useful or helpful compared to the alternatives.


Tidyverse is more than dplyr. It also includes libraries like ggplot2, for which there really is no peer. Also stringr, for string manipulation, tidyr, which has a wide variety of very useful utility functions for working with data. dplyr is simply the backbone that connects them all and allows you to move between them without switching thought processes.

dplyr also offers functionality that many don't realize beyond its basic data manipulation. tibbles allow for arbitrary data types for columns, meaning you can have data frames containing your typical strings or floats, but you can also have data frames with columns consisting of nested data frames, or fitted models, or other atypical objects. Once you start working in this way, it can really streamline a lot of complex analytical processes, make your code much cleaner and easier to work with, and allows you to integrate them directly into other tidyverse packages, like ggplot2.

> If I want to use an advanced data manipulation library, I'd typically reach for data.table

Data.table is powerful, but from a usability/readability perspective, most people find it inferior to dplyr. And there are now packages for using the dplyr API and having it run data.table on the backend, so I personally see little to no use for using data.table by itself anymore.

> If I want to use a verb based approach, why not SQL rather than dplyr.

This seems like an overly simplistic reduction. There are bigger differences between dplyr and SQL than both being a verb based approach. And even were there not, dplyr is directly integrated within R. It is still much easier to do your data manipulation directly in R and handing it back and forth between dplyr and whatever other libraries you are using, than it is to do the same in SQL (even utilizing SQL queries within R).

The beauty of the tidyverse is that it is unifying the most important aspects of the data science process under a single approach and API philosophy. It integrates in a way that is pretty unprecedented not just in R, but in any programming language (where data analysis tasks are concerned).


I don't want to say that the Tidyverse is inferior, because it's not and it works for a lot of people. Hadley will also know much more about R and always be a better programmer than me.

However, there are other tools which predate Tidyverse, which I think so a stellar job at their core competency.

ggplot2 is great, but it was also around long before the Tidyverse concept. But at the same time base R plot() can also be pretty powerful[1] and look great.[2] As an alternative to ggplot2 I would also propose that Vega Lite[3] could be a contender, with an excellent cross language ecosystem.

There are also libraries available for applying SQL on dataframes from directly within R if that's what you want to do. sqldf[4] has been around for a long time, and now there is also the new duckdf[5], which is a bit quicker. Or one can use the DBI[6] library, which requires a bit more coding. Learning SQL is also a great skill which has a lot of value outside R.

Tidyverse may be useful, helpful and convenient for a lot of people, but I think we shouldn't lose sight of the wide R ecosystem which has provided a lot of alternative packages for a long time, perhaps without the marketing and profile of RStudio and the Tidyverse.

[1] http://karolis.koncevicius.lt/posts/r_base_plotting_without_...

[2] https://github.com/KKPMW/basetheme

[3] https://vegawidget.github.io/vlbuildr/

[4] https://cran.r-project.org/web/packages/sqldf/index.html

[5] https://github.com/phillc73/duckdf

[6] https://cran.r-project.org/web/packages/DBI/index.html


> includes libraries like ggplot2, for which there really is no peer

lattice is largely ignored but it is quite similar to ggplot2 in terms of features, and (as a matter of opinion) the plots it produces are aesthetically more pleasing, also great for making subplots using conditioning variables


SQL is incredibly verbose compared to dplyr. Modern non-SQL query languages coming out tend to be more more similar to dplyr, based on method chaining or piping data table objects through function calls, like UNIX pipes. It's much more composable than SQL and it just makes intuitive sense as a sequence of data transformation steps in a pipeline.

My favorite feature of dplyr that makes it stand out compared to SQL is that a window function is just a group_by() without a summarize(). There's no separate syntax for "PARTITION BY".

For data analysis, whenever possible, I don't even write SQL anymore, I use the dbplyr package, which is a dplyr-to-SQL compiler.


I studied statistics, so R was the first programming I ever learned. I didn't know what it was at the time, but the tidyverse's chaining of operations on a data frame (via magrittr's %>%) was a really cool introduction to functional programming


The pipe operator in R also sort of does my head in.

I learned R in a way that everything went right to left. The variable on the left was manipulated by whatever (functions or other code) on the right.

The pipe operator reverses that flow, with everything now moving left to right, which I find difficult to follow and debug.


>The pipe operator in R also sort of does my head in.

As someone who taught myself base R from scratch in 2013, I (used to!) agree with this. When the pipe operator was first introduced, I’d roll my eyes whenever I saw a script that used it and move along.

But I forced my brain to adapt, and now it’s probably my favorite feature of the R language. Data science is full of sequences of transformations, and in my opinion it’s more readable and bug-resistant to phrase these long chains as:

  f(x) %>% g() %>% h()
rather than:

  h(g(f(x))) 
or certainly:

  foo <- f(x) 
  foo <- g(foo)
  foo <- h(foo)
I can comprehend and modify others’ (to include past versions of me) R code much more quickly with this paradigm. You can quickly debug a chain by commenting out functions sequentially (i.e. first test: “f(x) # %>% ...”). It also becomes much faster to plug new transformations into the chain, when needed.

One thing that helps to keep track of the input x as it moves through the chain is using the “.” placeholder (especially when you need to specify function arguments), like so:

  f(x) %>% 
    g(., n=100, param=“baz”) %>%
    mean(.$column_name)
Here, the . stands in for “whatever is coming out of the pipe” from the left.


The best use of the pipe operator, in my opinion, is when every pipe followed by an end line, like this:

  data %>%
    filter(thing > 4) %>%
    mutate(new = fun(old)) %>%
    group_by(var) %>%
    summarize(new_mean = mean(new))
Then you are reading the code top-to-bottom, just like any other R script but without all of the temporary variables.


This only really works if nothing ever fails.

It's great for data analysis and terrible for programming.

The reason it sucks for programming is because you can't debug it or inspect intermediate variables, which is really really annoying.

It's great for one off transformations and plotting, but a really, really really bad idea for programming.

Then you add NSE, which makes it hard to functionalise procedural pipes (especially for people who learned tidyverse) and it's a recipe for unmaintainable and profoundly annoying legacy code.

That bring said,I love it for interactive analysis.


Thanks for sharing.

I do exactly the same, and I always find it hard to justify base-R in front of my students, after they learn about the Tidyverse ;)


I find tidyverse a bit slow, and I don't like having to memorize all the verbs. I typically reach for data.table these days as a nice middle ground between base R and tidyverse. Your thoughts?


There’s like 3 verbs.

Select, for columns Filter, for rows Mutate, for adding rows or changing values

On top of that there are a bunch of sugared versions of these, like rename, instead of select or transmute instead of mutate.

When you want to do something advanced you either combine those or go out to see if somebody else has already made that function for you. I don’t think that having functions for joins or removing NA’s is something special to tidyverse. Everybody needs to put such operations somewhere.

I find tidyverse to have a simple, expandable vocabulary that is easy to learn, read and grow.

But it is slow.

I never found any kind of logic behind data.tables API, and I find it hard to read. I haven’t found an introduction to it that makes anything click.

But it is fast and it has these absolutely amazing rolling joins



Thank you for the link. I still end up confused.

How does selecting the 3rd column,

    dt[, 3]
jive with

>`DT[i, j, by, ...]` which means: “Take DT, subset rows using i, then calculate j ...

I don’t understand how column selection and calculating an expression is the same thing.

So now I can’t figure out if

    DT[, sum(V1)]
Will return the column indexed at the sum of V1 or if it will return the sum of the column called V1. How can I tell, from the syntax?


j is always for columns.

So for your first example it grabs the third column. The calculation you’re performing is a column selection.

For your second example, you’re returning all the rows, selecting column V1, and summing it.

I really prefer to name the operations

    dt[rows, columns, groups]


don't forget the crucial group_by() / summarise()


Thanks too. I had not heard of the "TidyVerse".

The TidyVerse is a opinionated group of R packages that all function similarly. ggplot2 is in there, and I already use that one.

I do like R for analysis. (I choose to use it over excel or Stata for my data analysis/ classes), though I wish the help pages were better.

Its a hard language to grok, even the data type names are unclear (data frame, vs matrix, vs vectors..). I'll take a look at the tidyverse packages.

https://www.tidyverse.org/learn/

any additional learning resources you'd recommend?


As the other commenter said, probably the best resource out there is "R for Data Science" by Hadley Wickham, the architect of the tidyverse. It will get you up and running with the most important parts of the tidyverse (dplyr for data manipulation, ggplot2 for data visualization, tidyr for general data cleaning utilities, etc). Its available online for free here: https://r4ds.had.co.nz/


The “R for Data Science” book by Hadley Wickham (creator of tidyverse, and I believe he is chief data scientist at R Studio) is hands down one of the best introductions to data exploration and analysis.


To add onto others’ answers, Advanced R from the same author dives more into the inner wirkings of R (such as lists X vectors, what are classes etc). It helped me understand why my code works the way it does.


Not a whole lot is new. Highlights (from my perspective) :

* matrices are now a subclass of arrays.

* better reference counting to save on memory.

* stringsAsFactors = FALSE is now default breaking a lot of packages (was this really worth changing?)

* a new way of defining raw strings: r"(...)"

* better default colors for plots. The new default color scheme is less saturated. Apart from looking better, in the old scheme something plotted in bright yellow on a white background was hard to read. Yellow is now more of a mustard.


> * stringsAsFactors = FALSE is now default breaking a lot of packages (was this really worth changing?)

In my experience, both when I learned R myself back in the day, and teaching others to use it, this is something that trips up newbies over and over again. So from that perspective changing it is good.

OTOH, a lot of existing code will need updating. Oh well..


> stringsAsFactors = FALSE is now default breaking a lot of packages

I know this is popular on the surface, but it's going to break so many "reproducible" analyses from manuscripts and scientific studies.


Reproducibles should be tied to specific versions of software or even OS and hardware. So no, this is not going to break correctly designed analyses. It's only going to break those which were not reproducible in the first place.


Hence the quotation marks, the state of software engineering in many scientific fields isn't very good, you'll be lucky if people list which package versions they used.


Fair enough. I might have misunderstood your remark originally, though the point stands, of course.


There is a global option controlling this. So if anyones analysis is broken, they can still set options(stringsAsFactors=TRUE).

Furthermore the functions that are inherently about factors (like expand.grid and as.data.frame.table) still use stringsAsFactors=TRUE.


> So if anyones analysis is broken, they can still set options(stringsAsFactors=TRUE).

But what are the chances that someone stumbling upon supposedly reproducible analysis happens to know that in R >4.0.0 they need to do this? Especially if the original analysis didn't even specify that it was originally run on R <4.0.0


There's a good chance they'll assume they need an R version from when the analysis was published, or to enable some sort of "old version compatibility mode" (does R have that?)


But it's a major version change and that's what major version changes are for.


What’s the benefit of the matrix change?

I’ve never really worked with matrix types in R. What can I do in R4, that I couldn’t in R3? What mistakes will I avoid doing now that matrices are a subclass of arrays?


* stringsAsFactors = FALSE is now default breaking a lot of packages (was this really worth changing?)

Definitely worth it. Not sure how many packages it will break and whether it would count as breaking at all.


Finally! I always forget to add it, and it inevitably causes frustration.


Congratulations to the people behind the release. Thank you guys very much.

Also, please don't pay attention to the haters. There are people who hate R, and those who wholeheartedly despise Python; those who loathe Java and plenty of those who can't stand Lisp. Though, R for some reason tends to trigger the most exemplary manifestations of this phenomenon.


The breaking changes are all quite severe, I hope there is real demand for the features they are bringing in.

Even if they don't agree with the direction Wickham's Tidyverse is going it showcased how flexible the R language was to being rewritten from the inside. Hadley effectively pulled of a more significant language upgrade with less breakages - the R core team could learn something from him here. Even their sensible na.rm default for new functions is introducing more weird inconsistencies to R.


Tidyverse introduces plenty of breakages, and as noted elsewhere does relentlessly between its own versions. Just check how the docs have evolved on ggplot2 for one example.

Tidyverse seems to be very much aimed at solving the initial learning curve. However, ANY system which uses allows non-experts to fake expertise denies them the ability to achieve real expertise, and also takes away skills from existing experts.

I'm grumpy about some of the reasons given for Tidyverse superiority: The Stringr package is supposedly somehow better because all of its functions start with "string" so you know you are working with strings. Huh? Because it replaces/wraps a lot of functions which begin with grep so you know you are working with regular expressions. Should we also create a "numbr" package which renames every function which works with "numbers"? Or maybe a subset of numbers, maybe we need a package called "integr".

data.table blows tbs out of the water, once you learn how to use it. Yes, it is harder to learn. Yes, it is very much worth it.

Yes, Hadley pulled off a language upgrade, and with the power of RStudio behind him has a strong hand indeed.

Just wonder if it is better or worse?


Stringr is better because the functions are all vectorized in simple and consistent ways.


Agreed! String manipulation functions in base R work differently in ways that make data cleaning annoying. Stringr frees me from having to remember that stuff and lets me focus on what I'm trying to do instead.


At this point Wickham is almost developing a parallel language that happens to be backwards compatible.


Has the R core team publicly stated that they disagree with direction Wickham's Tidyverse is going? Genuinely asking as I love the Tidyverse, but would be interested to hear arguments against it.


I am not sure why the original commenter said that tidyverse is more backwards-compatible compared with R core. It used to introduce breaking changes every 6 months or so.

Also they like it this way and promote it: https://twitter.com/hadleywickham/status/1175388442802479104


They still do.

I like the tidyverse, but Hadley's struggles with lazy evaluation and arguments has cost me lots and lots of time updating internal code at various workplaces.

Don't get me wrong, the tidyverse is great, but if I was writing R code that I expected to run without supervision for a long time, I'd avoid it as much as possible.


There are more then enough ways to keep it going. This has been address by several tools.

Personally I have some tiddy code that is 8 years old and it still is working.


Yes, and I have plenty of it which does not.


If you are looking for a faster, more concise alternative I highly recommend data.table.


perl is also very concise


Moving up to a major version (1.0) implies there could have been breaking changes if you follow semver. And including something like `unnest_legacy()` is helpful for people making the transition.

Just like `stringsAsFactors=FALSE` happened on a transition from 3.x to 4.x, because it is breaking, dplyr 1.0 had breaking changes.


Hadley is a member of the R foundation for what it's worth: https://www.r-project.org/foundation/members.html


R core and R foundation are different things: https://www.r-project.org/contributors.html


R is the quirkiest hard-to-remember programming language I have ever tried learning.


Kudos to all contributors. Learning R and using it was a lot of fun and really satisfying experience. Great option for someone like me, who has to do some basic analysis of large data sets without rich statistical background.


Did you learn R for your own fun or just to upgrade your skill set and get a job?


In my case it was Q vs R. For what I was trying to achieve and with my level of statistics background R was an obvious choice. The language/runtime AND RStudio are entirely free. Give it a try!


kdb+ is the only tool I've come across that solidly managed to outperform data.table for data processing. I have huge respect for good Q programmers


Likewise. A good kdb dev is a very rare and intelligent beast, hats off. I'm just too stupid to be able to comprehend much of Kx's ideas and so R was just much more appealing to me.


In my experience even bad Q programmers can outperform. The language and underlying engine is pretty efficient...


...better set of options: 1. for fun 2. get a job or job done 3. all of above


I’m disappointed 64-bit integers are not mentioned.

That is an unusual omission in a current language, and the add on packages are not always a substitute.

Is anyone here aware if that is planned to be a core addition?


My daughter is in grad school, learning programming for the first time in her life. She's learning R and Python at the same time. She's finding R much more intuitive.


Speaking as a programmer, R is deeply unintuitive, but...that's for an experienced programmer's intuition. I believe it is, if not intuitive, then at least less unintuitive for someone with a background other than programming (like, say, statistics).

The typical programmer's boggled response to R, somewhat like their response to SQL or CSS, is that most of the languages they look at are descended from C, and anything that is not looks weird. Rather like someone who knows only Indo-European languages encountering a non-IE language for the first time.

Of course, any real-life language, whether R or CSS or SQL or C or anything else, has plenty of actual defects to complain about. Like, for example, "=" meaning "change the thing on my left to be the same as what's on my right". But if it's the same defects that you're used to, you don't see them as much.


I agree that if you have a background in a common programming language, R is not intuitive. I know it from personal experience. However, for someone with no programming background, R may be quite reasonable. Just judging by my daughter's experience.


Agreed.


Is she being taught a particular framework in Python, e.g. pandas for doing statistics? When I was teaching programming to non-CS grad students, I chose Python for being more all-purpose and explicit in its syntax/conventions (the latter being not too important for most novices). But I envied the relative ease of setup for R – basically, download R Studio – and how quickly anyone can turn a dataset into nice-looking viz thanks to ggplot2.


Being taught is not really what's taking place. It's more like here's the task, for example the LSVT dataset from UCI - go and build a classifier. Obviously, pandas is a workhorse, and pytorch is also on the horizon. But, in reality she's learning by Google and skype calls with Dad. I have to say I'm stoked that I'm being able to help my daughter in grad school with homework. So many people tap out in elementary.


Hmm. A pocket knife is more intuitive than a CNC machine, especially to someone who's never seen either. Hard to recommend it for most purposes, though, particularly when scale and quality matter.


R is the worst programming language I've touched. I usually hear in response, "it may be quirky, but it has a lot of well-tested functions useful for data analysis."

This confidence seems misplaced. Experienced developers using better languages with better tooling create bugs all the time. How likely is it that R packages, written by specialists (or grad students) whose main focus is often not programming but some other discipline, in a language full of "quirks," are going to be really reliable?

One quirk that got me the last time I touched R was that functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

Fun fact, a lot of the R source code is machine translated from Fortran to C, but it has been cleaned up a lot. https://github.com/wch/r-source/search?p=1&q=f2c&unscoped_q=...


R's killer app is interactive analysis. Import some data, graph it, look at the graph, do some clean-up based on what you noticed, graph it again, run some summary statistics, fit a model, graph the residuals from the model, etc. It's amazing for this - arguably the best available open source software for this type of task. Comparable tools for this would be MATLAB, Stata, Mathematica, etc. (all with strengths in slightly different domains).

It's comparatively very weak for "software development" - writing libraries, long-running programs, etc. Even data cleaning/data analysis scripts are a headache beyond a certain length or number of contributors. People tend to have the same complaints about MATLAB, Stata, and Mathematica.

These are related - a lot of the features that make interactive use concise and easy to pick up end up being kind of a mess to handle consistently in a larger program.

Python is at a different point on this spectrum. It's moderately good for interactive analysis, but there are still a lot of tasks that are concise and easy in R that require a bunch of verbose object manipulation in Python. This is a choice - "explicit is better than implicit", etc. Python is also moderately good for "software development" - it has enough consistency and code structuring features that you can write libraries, systems, and infrastructure while collaborating with a small number of other people pretty easily. You still hit a point where Python's dynamic features make things like refactoring harder than you'd like, but it's usable.


Wholeheartedly agree.

I used to do mostly data analysis in my day-to-day work and R was my go-to and absolute favorite language for years in terms of usability for data analysis. Doing actual software development in R is quirky at best, to be honest.

Nowadays I write code for research that requires 'actual' software development, so I've been using python almost exclusively (with pytorch under the hood, which I love.) No doubt, python is a better language for software engineering.

Nevertheless, for analysis I just cannot warm up to numpy/pandas/matplotlib _at all_. When it's time to analyze results of my experiments or produce publication level graphics, I write my python results to disk and use the tidyverse as a last mile solution.


This is exactly why I still use R. Throw data in a file, suck it into R, start playing with slicing and visualization with a few commands. Once I get something interesting, a few more tweaks and I have a nice, descriptive graph.


RE: functions/variables -- you don't quite have this right. Let's walk through when this can occur and when this can't. First, this cannot occur if you're talking about your local environment:

  z <- function() { print("hello world") }
  z <- 3
  z # Outputs 3
  z() # Errors because there is no function z()
What the actual quirk you're talking about has to do with attaching other namespaces. So, the way R works is that by default it loads several packages. For instance, the reason you can call `mean()` is that the "base" package is attached when you open a blank session in R. It is absolutely possible to have multiple symbols take the same label across namespaces. Here's an example:

  mean <- 4 # Creates variable called mean in the local namespace
  mean(c(5, 10, 20)) # Uses the mean function in the attached base namespace
  mean # Uses the mean variable in the local namespace
But this is not unique to variables versus functions, it also works with functions:

  mean <- function(x) { sum(x) * 100 }
  mean(c(1, 2, 3)) # Outputs 600
  base::mean(c(1, 2, 3)) # Outputs 2
  rm(mean) # Drop local namespace function
  mean(c(1, 2, 3)) # Outputs 2
You can use sessionInfo() to see the order of attached namespaces that will be searched. In RStudio, you can also press the little dropdown arrow next to "Global Environment" in the environment pane to see the order of attached namespaces -- it'll search them in that order. Alternatively, you can be defensive and always prefix all functions in any namespace with the namespace name (equivalent to always using module.function style function calls in python).

Finally, you can use the built-in function "find" to see the order in which R will try to resolve a symbol, e.g.

  sum <- function() { print("i like sum coding in R!!!") }
  find("sum") # .GlobalEnv first, base second

I'm not sure this is a quirk. What are your options as a namespaced language? 1) Allow users to import multiple functions/variables with the same name across different namespaces and resolve the conflict via some kind of hierarchical order; 2) Don't allow users to import multiple functions/variables with the same name, so you can never use overloading to monkey patch; 3) Always require namespace prefixes at all times; 4) Make function dispatch a blocking operation that asks REPL users which to use? I dunno, it doesn't seem to me like R's approach is any less sane.

I guess one thing that is different about R's approach is that "built-ins" have no special priority, they're all part of some namespaces that are attached by default but otherwise exactly like third-party libraries (base, stats, graphics, etc.)

In R, the only reserved words that cannot be overloaded are while, repeat, next, if, function, for, and break. (Note: else does not appear here because of a genuinely baffling quirk about how else is implemented in R)


As much as I love R, there are indeed quirks that I think R allows to happen but shouldn't.

For example:

    mean(2, 4) # outputs 2
    mean(c(2, 4)) # outputs 3
I really don't see why R allows you to enter mean(2,4) without giving an error.


Thanks for the explanation. Yes, your second example is something like what I found in R code I was asked to modify.

What I find distasteful is that when calling mean(), the resolution of this name depends NOT on whether the local variable mean has been defined, but whether it has been assigned a function. This is illustrated by your 2nd and 3rd examples.

Of course if you are used to it, it may not catch you by surprise.


Can an alias be an option to resolve conflict: “import mypackage.mean as mymean”?


No aliases (afaik), but convention is that you explicitly call mypackage::mean in cases where names might even hint at being ambigous.


On a tangent here, but is there a reason to use <- instead of = for assignment?

Actually, to answer my own question (I know google), I found this very informative stackoverflow answer [1]. My TLDR: no difference, other than "<-" is more likely to cause carpal tunnel syndrome.

[1] https://stackoverflow.com/questions/1741820/what-are-the-dif...


> How likely is it that R packages, written by specialists (or grad students) whose main focus is often not programming but some other discipline, in a language full of "quirks," are going to be really reliable?

Do you have the data to back that up or is just a question that you don't have an answer to?

I think the statement brush off the high quality R packages that already in the ecosystem that is not in anywhere else. The author of element of statistic learning created glmnet package which took python years to even have ported. There are many other packages out there that Python does not have.

I am not going to argue it's the prettiest language.

But there are tons of packages and many found no where else in other language ecosystem.

If you're going to say just use rest/rpc then it just defeat the purpose of your initial argument.

And just look at springer or, gosh the other publisher escaped my mind right now, but these publishers have book on statistical subjects with R packages accompany the book.

It may be ugly but the packages are maintain by expert in the field of statistic and they may not be programmer. But they do dog food their package and use dataset to see the results. Likewise just read up on the Ranger Rpackage paper (https://arxiv.org/pdf/1508.04409.pdf). They test their output with the other randomforest package.

And it's silly to point this argument at just R when the same happen to Python. The bootstrap function in SciKit-Learn for the longest time didn't even really do bootstrap. The linear regression function automatically does shrinkage with no option to turn off.

No language is perfect. But I believe R have a place and especially in statistic. Many many wonderful expert statistician are maintain and creating R packages (eg Dr. Frank E. Harrell Jr. ) Many R packages have accompanying paper publish here https://www.jstatsoft.org/index


> functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

The two namespace approach makes it harder to assign 4 to foo and then call it, so I don't understand this comment.

You can assign 4 to foo and try to call it in a language that has one namespace.

  ;; Scheme, one namespace:
  (let ((foo 4))
    (foo) ;; oops

  ;; Common Lisp, two spaces:

  (let ((foo 4))
    (foo) ;; still calls the foo function,
          ;; not related to or shadowed by the above variable.
    (funcall foo)) ;; oops
The usual valid complaint about two namespaces is that funcall is required all over the place in code that works with higher-order functions, which uglifies the code, and that an operator like (function foo) or its abbreviation #'foo is required to lift references to functions as values instead of just writing foo, which likewise uglifies code.


R to Python convert here (numpy, pandas). Agreed about the general claim of clunkiness, but at least for statistical computing R still wins because of the richness of the long tail of packages (representative example: https://cran.r-project.org/web/packages/poweRlaw/index.html). In my experience, Python equivalents are much less developed and documented, if extant. Many data scientists I know would disagree with this claim, but that's because they tend to stick with things supported by scipy, pymc3, statsmodels, and a few other common libraries.

One solution I have found for small and medium data is to use rpy in Jupyter to let me keep most of my workflow in python, then shuttle stuff to R for exotic tests or to use key packages (ggplot, brms, lme4).


Separate namespaces for function names and data variables is a feature inherited from Lisp which is the language R evolved from. Not really a quirk. Common Lisp and Emacs Lisp also have separate namespaces, while Scheme has a single namespace.

The thing about a lot of R code not being written by programming specialists is a valid point, but then again not many programming specialists are also specialist in statistics so... the alternative is what?


R might be a bad choice to build software but for statistics and data analysis I haven't encountered a better tool. It's also trivial to simply write functions in C/C++/Java/Fortran and call it from R.


Exactly! I use R every day as the head of the data science department at a corporation. Most of my work is medium-sized data analysis projects and nothing can touch R for that level of work.


> functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

The most famous language with this property is Lisp.


Unix shells also compete in this category.

We can have a $foo variable, and a foo command/function.

GNU Make is another one, sort of.

Makefile:

  warning = abc
  $(info abc = $(warning))
  $(warning what?)
Output:

  abc = abc
  Makefile:3: what?
  make: *** No targets.  Stop.
$(warning) is a variable, whereas $(warning args ...) is an operator call.

If a macro is stored in a variable V, it cannot be called as $(V args), but using $(call V args). That's analogous to funcall in Common Lisp.

Looks like the Lisp-2 approach is well represented in the famous language scene. :)


You can't really think of R as a programming language in the sense of how we normally thing of programing languages. I see it as more of a computational scripting language. It is great at what it does.


At the same time there are companies in which there's R code running in production to serve ML results in real time. No idea how common it is, but I've worked for such a company. My point is that R isn't just for scripting and interactive use even if that's its best use case.


A plumber in front of your ML serving function running in a docker container. There, simple, fast enough. We have multiple R containers in production, it is working fine and the data scientist are happy that they can keep working in RStudio.


> One quirk that got me the last time I touched R was that functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

Either I'm greatly misunderstanding or this is just plain wrong - assigning something to the same name as an existing function will just overwrite the function.


You're correct provided the overwrite is occurring in the same namespace (e.g. the local environment). He is correct for the wrong reason if you try to alias something from another environment. For example, the following R code may initial seem weird:

   my_vec <- c(1, 2, 3)
   c <- 4
   print(c)
   my_vec2 <- c(5, 6, 7)
But as I describe in my top level comment, it has nothing to do with variables versus functions, just how namespaces work.


If you've gone from SAS to R, R seems very much improved.


> it has a lot of well-tested functions

I tried to do a simple word cloud based on a Twitter search in R. The whole thing was plagued by mysterious failures due, AFAICT, to weird, rare, non-standard UTF characters in the source data that were crashing some of the R libraries I was using to clean the data.

Having done that before in Python, it was shocking to see how fragile the R ecosystem is.


As another example: I tried to get some R code distributed with MPI using a package that claims to do this. If I remember correctly, the package would generate and execute a shell script to spin up subprocesses. Then it would shuttle code (including huge matrices serialized to ASCII) over a socket, to be eval()'d on the other side. The secret codeword "kreab" would terminate the connection, so this could appear nowhere in the code sent over the socket.

This is clownshoes software quality. And it's used for important scientific research.


First off, R may be annoying to learn (for instance, just concatenating two strings takes much more typing than Python, say) but once you get to data analysis it’s a dream. (And certainly more Pythonic than Python’s alternatives, which result in code cluttered with lots of periods and references to packages and sub-packages everywhere). Also, the packages work really well with each other—-with Python it seemed like every time I’d upgrade packages with pip, I’d end up with a new conflict I’d have to sort out. And, frankly, having to rely on Anaconda to keep everything from breaking is not exactly ideal. So props to CRAN!

Gotta say, though, one of the most annoying things about R is the name. Entering “R” as a search term on job boards tends to lead to a lot of not-very-helpful results.

Also, it’d help if the version release names had more order to them. So you’d read some ridiculous phrase like “parachute trombone” and know that it must’ve been released after “mouse parade”.


Is it still the case that when you have an error in your script (and aren't using RStudio) that you don't get the line number of the error, you just get a traceback?


What impact does this have on people _learning_ R?

It's a new major version - does it require significant updates for training materials? Are there a lot of outdated idioms now?


I love R. But working with 50 million rows in ram is a pain. Anyway to do it differently? I use data.table BTW, so asfactor is always false


For what purpose? There are tons of ways to do it differently, see here: https://cran.r-project.org/web/views/HighPerformanceComputin...


Buy more ram?

If you're regularly running into memory performance issues despite using data.table, and you don't feel like fiddling with mmap, consider looking into an APL variant.


Will look into mmap. Thx


Please start release announcements by saying briefly what it is! "R is a programming language for statistical analysis" or whatever. Don't assume that people are familiar with every possible piece of software already.

Also pull important stuff to the top of announcement. Were there security issues that need you to upgrade? What are the major new features that would encourage you to upgrade?

For example here's a release announcement I wrote recently: https://www.redhat.com/archives/libguestfs/2020-February/msg...

Free software doesn't usually have an advertising budget, so you have educate people on what your software does at every opportunity you get.


This was posted to the r-announce mailing list.

I think it's reasonable to assume that this message was intended for subscribers of that mailing list, and that the people who chose to subscribe already know what R is.


And yet the message appears here, on a general tech news site, and no doubt many other places. Assuming only R-announce readers will see it is plainly wrong.


> For example here's a release announcement I wrote recently: https://www.redhat.com/archives/libguestfs/2020-February/msg....

Request to please start release announcements by saying what really a "Network Block Device (NBD) server" is. Don't assume that people are familiar with every possible piece of software already.


This is a fair point. I think earlier versions did try to explain what NBD is but they were a bit wordy, and just expanding NBD was deemed sufficient. Next time I'll see if I can get that better.

Of course there's a level beyond which you don't really need to go - I wouldn't suggest explaining what Linux is or what software is.


Frankly my reply was just rhetoric, and I don't like the suggestion of explaining the base of things in release notes. If someone is interested enough they would care to look it up. Those who are not interested would skip the notes anyways.

Release notes are usually for people who are already using it.


To be fair, the announcement was posted to the R-announce mailing list, and it seems probable that subscribers would know what R is.


But here we are reading it on a general tech news site.


Thats not the fault of the R developers who wrote a message appropriate for the audience to which they sent it.


And they never imagined that the announcement of a major release of a popular piece of software would go beyond the mailing list?


Should every message be written with all possible audiences in mind?


It's lucky I never said that because obviously that would be stupid. However the release of a major version of a popular piece of software should be expected to go beyond the mailing list - and guess what, it being on the front page of HN all morning proves that exact point.


Looks good


I hope there was an R 2d2 release at some point...


There was, just not from the R team. The original name of my project for calling from R into D was called rtod2 (playing off D version 2).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: