R 4.0

zhdc1 · on April 24, 2020

This alone is reason to upgrade -> "R now uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table()."

amrrs · on April 24, 2020

This is one primary reason I started using read_csv from readr than read.csv of base-R. Everytime you teach someone to read a csv using base-R this always came up somewhere in the middle of an analysis because those strings were read as Factors.

crispyambulance · on April 24, 2020

Yes, factors absolutely can be a pain in the ass when they're instituted too early in an analysis. It's better to keep them as strings for as long as possible and only convert to factor when you've cleaned up your data. Otherwise, you have to deal with annoying and confusing factor manipulation.

The drawback is that character takes up so much space, but these days memory is so bountiful it usually doesn't matter.

dash2 · on April 24, 2020

Nowadays strings are stored in a "string pool" anyway, so if you have a string that can be turned into a factor (i.e. with few unique variants), you probably don't need to.

s1t5 · on April 24, 2020

But readr gives you a tibble instead of a plain data.frame and that adds a bunch of other headaches.

crispyambulance · on April 24, 2020

How so? I can't think of any drawbacks of tibbles vs data.frames?

_fnhr · on April 24, 2020

Using tibbles outside of tidyverse can be dangerous.

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1

Imagine now using some complex function from a repository (i.e. BioConductor) that works on data.frames and passing a tibble to it.

notafraudster · on April 24, 2020

Post edit: Preserving the below post, which I think highlights some of the issues in the parent I'm responding to's example, but it turns out he is correct on another issue -- narrowly, in his example, the `[` subset command dispatches differently on tibbles than data.frames, so you can narrowly produce weird behaviour. So to anyone reading, please consider upvoting parent and reading rest of thread

Original post follows:

Right off the bat, the problem is not "using" tibbles, it's that you've incorrectly constructed one by passing the data through the tibble() constructor rather than using as_tibble(). The tibble constructor -- for pretty good reasons in other circumstances that seem crazy to you here because of your intent -- infers that you want the entire data frame to be a single column inside the tibble, called "iris". It does this because it evaluates the variable name passed to the tibble constructor as both the intended column name and the data to be placed inside the column. This demonstrates nesting, which is one of the great features of tibbles and otherwise used for a bunch of stuff.

If you had done `tb_iris <- as_tibble(iris)`, it would have worked fine. `as_tibble()` is the function to convert an existing data structure to a tibble. R is obviously not "type safe" in any way, but you can engage in defensive programming, and one way you can do that is being hyper-aware of the steps you take during type conversions. If you check the documentation for `tibble()`, it tells you explicitly to "Use as_tibble() to turn an existing object into a tibble." Is there a reason you didn't? Imagine this related example:

  my_string <- "10"
  numeric(my_string)
  as.numeric(my_string)

Would we conclude that "using the numeric type can be dangerous" because the constructor interpreted the argument different than the conversion helper?

Second, I suspect you must be using extremely old versions of things, because on more recent versions, your nunique function would fail, not produce 1. I correctly get "Error: Can't find column `Species` in `.data`." This error message is maybe a little confusing if you don't check the structure `str(tb_iris)` of tb_iris to see what I mentioned above, but is the correct error to output in light of it. You'd also be able to flag this by just checking `colnames(tb_iris)` or `View(tb_iris)` if you're working in RStudio or using the embedded environment pane or really any other way of looking at the data.

But your broader point is also false. Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes. The only thing that makes a tibble different than a data.frame is that it has an additional class label. All dispatches that work on data.frame objects work on tibbles because of how multiple classing works in R. This has been a goal since the beginning of tibble. The one exception I'm aware of is external functions that incorrectly check `if(class(obj) == "data.frame")` instead of using `is.data.frame()` or `if("data.frame" %in% class(obj))`. The former is and always has been incorrect because of how multiple dispatch is designed to work in R and should generate an error with multi-classed objects because the if statement evaluates to a vector of logicals instead of a logical.

Once way you can tell that tibbles and data frames are identical save the above caveat is to run the following code:

  df_iris <- iris
  tb_iris <- as_tibble(iris)
  identical(df_iris, tb_iris)
  class(tb_iris) <- "data.frame"
  identical(df_iris, tb_iris)

Note that you are not "downconverting" a tibble into a data.frame in this code (but that would work too) -- you are taking the tibble exactly as is and hacking its class label to look like a data frame. It's identical because a tibble was always a data frame.

_fnhr · on April 24, 2020

I think everything you wrote here is false, so I am not sure how to reply. Will try to keep it respectful and short:

First, about the as_tibble - it returns the same thing as tibble:

    tb_iris <- as_tibble(iris)
    length(unique(tb_iris[,"Species"]))
    > 1

Second, about the incorrect version:

    > packageVersion("tibble")
    [1] ‘3.0.1’

Which is also the current version on CRAN.

Third, about the classes:

You say:

> Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes.

This is not the case. You can add any class to any object in R S3 system. So people behind tibble can call their tibble a data.frame but it gives no guarantee that it will behave like one.

More about this problem here (and you can also find replies from tidyverse authors) https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...

notafraudster · on April 24, 2020

Actually your reply was very helpful because it surfaced ways in which you were partially right and I was partially wrong.

I highlighted the nesting issue in constructing versus coercing (which is correct and does have implications for what you're trying to do) but actually in your example the distinction is broken because of a different edge case

Which is to say the following:

  ncol(iris) # 5
  ncol(as_tibble(iris)) # 5
  ncol(tibble(iris) # 1

  iris$Species # Works
  as_tibble(iris)$Species # Works
  tibble(iris)$Species # Errors because of nesting

  iris[, "Species"] # Works
  tibble(iris)[, "Species"] # Doesn't work
  as_tibble(iris)[, "Species"] # Works

However, you're correct that because the subset operator for tibble doesn't drop dimensions, length gets you the number of columns rather than the number of observations. This does speak to the fact that length is a pretty shitty function to begin with, but I concede you're partially correct there.

You are also correct that because class labels are not contractual, there is no guarantee that having the data.frame fallback label means stuff behaves identically (for instance, you could add the data.frame label to any data structure and the data.frame dispatch stuff would not work properly). My point was that in the case of a tibble, a tibble is literally a data frame with an additional class label. If you remove that class label, it's exactly identical.

But your example and linked discussion does highlight a way in which I'm wrong; the subset function is overridden for something with a tibble class label. That's true and could produce edge cases I hadn't considered.

Apologies for any hostility in my original reply.

balnaphone · on April 24, 2020

I'm sorry to report that this analysis is completely wrong, and demonstrates a lack of understanding of the R object model. The class that is provided by tibble does not implement all of data.frame, and the OP is correct.

notafraudster · on April 24, 2020

(S3 -- see footnote) Classes don't "implement" anything in R the way they would in other languages. They are labels that tell dispatch functions how to deal with an object. A tibble is internally a data frame. The last example in my post makes this exactly clear.

The other OO systems in R do act closer to traditional classes, but all the tidyverse stuff is S3.

(But the OP was correct in another sense related to the example narrowly!)

stewbrew · on April 25, 2020

So you're ignoring that the [-function by design works differently for tibbles than for data frames. This isn't really a problem with tibble but with sloppiness in programming allowed by dynamic languages.

I personally think it's a good thing that the drop-argument defaults to FALSE for tibbles, since data frame's default drop = TRUE is a source of frequent bugs. The change of the default for this parameter is the source of your observation.

_fnhr · on April 25, 2020

I am not ignoring it, I am _highlighting_ it. The question of the comment above was "why would one prefer data.frame over tibble". I merely answered that question.

stewbrew · on April 25, 2020

Yes, but the problem isn't tibble since what you're highlighting is a design choice and an argument in favor of tibble. The problem only arises when you're not aware of this design choice which is facilitated by sloppiness and dynmically typed languages.

One might ask whether it was a good idea that tibble enlists data.frame as an inherited class. Since a tibble obviously doesn't behave like a data frame, one could also argue that this is a mistake on part of the tibble developers but this is a different discussion.

_fnhr · on April 25, 2020

All I am saying is that there are perfectly good reasons for not using tibbles if you do any kind of work outside of tidyverse. And you seem to agree?

As for whether or not tibbles should be data.frames - I posted a link to this exact discussion on R-dev mailing list within this thread, as an answer to a different poster. Here it is: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...

stewbrew · on April 25, 2020

Ok, now I understand where you're coming from.

kgwgk · on April 24, 2020

Just put a as.data.frame( ) around it. That's what I do with readxl::read_excel :-)

FiReaNG3L · on April 24, 2020

One of the big reasons why I quit R 10 years ago and never looked back - Python wasn't secretely converting anything, AND failing silently when its not the expected type.

kyllo · on April 24, 2020

R's come a long way in the last decade. The tibble and data.table packages both address this issue. data.table (https://github.com/Rdatatable/data.table) is the more strongly-typed of the two, by default it fails loudly when it encounters data that doesn't conform to the column type. It's also quite fast--binds to C code that parallelizes with OpenMP. It has very terse and expressive syntax, I find it so much more intuitive and easy to work with than pandas.

If you're happy with Python, by all means keep using it. I use both languages. Just suggesting that if you gave up on R that long ago, you might be pleasantly surprised by how much better it's gotten since then.

Tarq0n · on April 24, 2020

vctrs [0] is the latest effort by the Rstudio developers to help people write type-stable code. The R standard library has a lot of issues with silently casting types, but the wonderful thing about it being so scheme-like is that many of these things can be evolved through libraries.

[0] https://github.com/r-lib/vctrs

closed · on April 24, 2020

By python, are you also including pandas? Because it is definitely doing a lot of that!

Breza · on May 2, 2020

Exactly! TensorFlow is the only Python data analysis package I've used that doesn't automatically convert things in the background. I was helping a friend with STATA the other day, which doesn't automatically convert, and I realized I've gotten so used to that behavior in R and Python.

lmg643 · on April 24, 2020

i stopped using R a number of years ago because it is not useful for very large datasets (in the tens to hundreds of millions) and now use kdb almost exclusively.

stewbrew · on April 25, 2020

R has quite a few specialized libs to deal with large datasets (out of memory). Nothing keeps you from hosting the data in a DBMS and using SQL (or dplyr) to pull the data in an appropriate format.

Breza · on May 2, 2020

I run a data science department at a corporation and this is exactly how we handle our massive amounts of data. It's rare that we're using a billion+ data points in one model so we use SQL to get the data we need in the format we need and move forward from there in R.

ggrothendieck · on April 24, 2020

The new C++ derived syntax for string literals seems to me to be the top new feature. It will make it possible to support markdown, latex, R code and Windows path literals without munging them first.

ryndbfsrw · on April 24, 2020

If I understand, does this mean on windows when I've copied a file ___location I would no longer have to replace backslash in the file path with double back slash or forward slashes? Or am I off-piste?

ggrothendieck · on April 24, 2020

Correct. One can write r"(c:\Windows\System32)" . Check out additional examples at the end of the help file: https://github.com/wch/r-source/blob/trunk/src/library/base/...

stilisstuk · on April 24, 2020

Awesome

ryndbfsrw · on April 24, 2020

I'm so used to using data.table to import data I forgot this was still a thing

minimaxir · on April 24, 2020

One of the selling points of tidyverse’s readr was “no stringAsFactors = FALSE necessary.” That’s how annoying it is.

kgwgk · on April 24, 2020

You could add to your .Rprofile

  options(stringsAsFactors=FALSE)

thom · on April 24, 2020

This is likely to break things when you share or publish your code.

kgwgk · on April 24, 2020

Upgrading will also break things. And code which depends on defaults will be broken either in new releases or in older releases.

thom · on April 24, 2020

Yes, I mean, we could list things all day that make R a horrible environment for reproducible analysis or production deployment, but it is what it is.

kgwgk · on April 24, 2020

You’re right. My defensive reply was unwarranted.

s1t5 · on April 24, 2020

Upgrading makes the changes explicit, tinkering with your environment variables doesn't.

kgwgk · on April 24, 2020

Upgrading is not different than changing environment variables as far as breaking existing code is concerned.

You can always run R --no-init-file to be sure that you have the default settings. Now you have to know what default settings the code that you want to run expects.

truculent · on April 24, 2020

And make your code potentially non-reproducible?

kgwgk · on April 24, 2020

If you care about that don't rely on defaults. This upgrade makes old code non-reproducible, should everyone abstain from upgrading?

baldfat · on April 24, 2020

Well that is why there is 4.0. Hadley Wickham has had a HUGE influence on R and it is now we have a lot of new things we can use that makes it reproducible in base R.

paultopia · on April 24, 2020

This makes me want to sing and dance. Ding-dong, the stringsAsFactors witch is dead!

topheroo · on April 24, 2020

I thought exactly the same thing! Been a long time coming…

this_is_not_you · on April 24, 2020

My thoughts exactly when I saw that bullet point.

bransonf · on April 24, 2020

As someone who writes R daily, I’m really excited for 4.0. That said, R still leaves a lot to be desired. Changing the default for stringsAsFactors is great, and I think it reflects a small shift in R from being an exclusively stats-based language to something more general purpose. The nature of stringsAsFactors is that in most statistical models you need your categorical variables to be ordinal.

That said, R still is a stats language by design. In Python or JS for example, you can concatenate strings with ‘a’ + ‘b’ but the + operator in R is explicitly only for numeric types. R also has a horrible architecture for memory management, leading to code that uses profound amounts of RAM. I face this issue constantly as I work with very large datasets.

I have a love/hate relationship with R and despise using it in production. I’m also not a fan of the divergence that Tidyverse has caused. Particularly the expectation of Non-standard evaluation and the tendency for new R learners to become dependent on these packages. Especially as it relates to reproducibility and deploying code, these unnecessary dependencies suck. Tidyverse is maturing and breaking changes are still too common for comfort. There is no reason in my opinion to load stringr when a grep() will suffice. Or, to subset with select when [[ works perfectly fine. Or to filter when subsetting on a logical with which()... the list goes on. Tidyverse is essentially reinventing the wheel in many places. The biggest problem is that it doesn’t translate well to base-R in my experience with new programmers, leading to this divergence.

That said, piping with %>% and modifying directly with %<>%(via magrittr) is a pleasure that other languages I’ve worked with don’t manage as well.

And at the end of the day, I’m not going to rewrite implementations of all the latest statistical methods already written in R, and this is its strong suit. I’m increasingly using sophisticated spatial and spatiotemporal methods, and these methods are solely implemented in R.

I understand that R gets a lot of flack from software developers and I understand why. But, I also think it’s too often overlooked for its strong suits.

jrumbut · on April 25, 2020

Maybe stringsAsFactors was a mistake in the original design, but there is so much code out there reliant on this behavior now and since it was the default you don't really know where the new behavorior will bite you besides looking for calls to data.frame that don't set the parameter.

Plus, it's not such a bad feature when you know it's coming.

As far as the tidyverse goes, I get it now however it seems to discourage the creation of a nice, well organized set of functions to limit the amount you need to keep in your head at the same time, and a lot of R users are very smart people capable of understanding very disorganized code. Instead of functions you get copy/pasted incantations, in Base R it's at least broken down into steps which is a start.

AuthorizedCust · on April 24, 2020

I teach a data science graduate course to non-tech majors, and it's mostly on R. I welcome these changes.

I teach base R first for a few weeks, then I teach the TidyVerse as I introduce data science concepts, like text mining. I convey the TidyVerse as like an overlay on R, improving shortcomings and adding great functionality, syntactic sugar, etc.

This context switch is jarring to some students. It's great to see some of the TidyVerse's strengths--part of the tibble--be moved back into base R.

Now pardon me: I need to go. I start teaching dplyr in 51 minutes!

phillc73 · on April 24, 2020

I must admit to not really understanding the TidyVerse attraction, or why I should be doing "tidy data evaluation", rather than using base R.

If I want to use an advanced data manipulation library, I'd typically reach for data.table. If I want to use a verb based approach, why not SQL rather than dplyr.

I have tried dplyr and code I've written a few years ago still imports that package (hopefully it still works), but I just didn't find it was particularly useful or helpful compared to the alternatives.

NeutralCrane · on April 24, 2020

Tidyverse is more than dplyr. It also includes libraries like ggplot2, for which there really is no peer. Also stringr, for string manipulation, tidyr, which has a wide variety of very useful utility functions for working with data. dplyr is simply the backbone that connects them all and allows you to move between them without switching thought processes.

dplyr also offers functionality that many don't realize beyond its basic data manipulation. tibbles allow for arbitrary data types for columns, meaning you can have data frames containing your typical strings or floats, but you can also have data frames with columns consisting of nested data frames, or fitted models, or other atypical objects. Once you start working in this way, it can really streamline a lot of complex analytical processes, make your code much cleaner and easier to work with, and allows you to integrate them directly into other tidyverse packages, like ggplot2.

> If I want to use an advanced data manipulation library, I'd typically reach for data.table

Data.table is powerful, but from a usability/readability perspective, most people find it inferior to dplyr. And there are now packages for using the dplyr API and having it run data.table on the backend, so I personally see little to no use for using data.table by itself anymore.

> If I want to use a verb based approach, why not SQL rather than dplyr.

This seems like an overly simplistic reduction. There are bigger differences between dplyr and SQL than both being a verb based approach. And even were there not, dplyr is directly integrated within R. It is still much easier to do your data manipulation directly in R and handing it back and forth between dplyr and whatever other libraries you are using, than it is to do the same in SQL (even utilizing SQL queries within R).

The beauty of the tidyverse is that it is unifying the most important aspects of the data science process under a single approach and API philosophy. It integrates in a way that is pretty unprecedented not just in R, but in any programming language (where data analysis tasks are concerned).

phillc73 · on April 24, 2020

I don't want to say that the Tidyverse is inferior, because it's not and it works for a lot of people. Hadley will also know much more about R and always be a better programmer than me.

However, there are other tools which predate Tidyverse, which I think so a stellar job at their core competency.

ggplot2 is great, but it was also around long before the Tidyverse concept. But at the same time base R plot() can also be pretty powerful[1] and look great.[2] As an alternative to ggplot2 I would also propose that Vega Lite[3] could be a contender, with an excellent cross language ecosystem.

There are also libraries available for applying SQL on dataframes from directly within R if that's what you want to do. sqldf[4] has been around for a long time, and now there is also the new duckdf[5], which is a bit quicker. Or one can use the DBI[6] library, which requires a bit more coding. Learning SQL is also a great skill which has a lot of value outside R.

Tidyverse may be useful, helpful and convenient for a lot of people, but I think we shouldn't lose sight of the wide R ecosystem which has provided a lot of alternative packages for a long time, perhaps without the marketing and profile of RStudio and the Tidyverse.

[1] http://karolis.koncevicius.lt/posts/r_base_plotting_without_...

[2] https://github.com/KKPMW/basetheme

[3] https://vegawidget.github.io/vlbuildr/

[4] https://cran.r-project.org/web/packages/sqldf/index.html

[5] https://github.com/phillc73/duckdf

[6] https://cran.r-project.org/web/packages/DBI/index.html

lottin · on April 24, 2020

> includes libraries like ggplot2, for which there really is no peer

lattice is largely ignored but it is quite similar to ggplot2 in terms of features, and (as a matter of opinion) the plots it produces are aesthetically more pleasing, also great for making subplots using conditioning variables

kyllo · on April 24, 2020

SQL is incredibly verbose compared to dplyr. Modern non-SQL query languages coming out tend to be more more similar to dplyr, based on method chaining or piping data table objects through function calls, like UNIX pipes. It's much more composable than SQL and it just makes intuitive sense as a sequence of data transformation steps in a pipeline.

My favorite feature of dplyr that makes it stand out compared to SQL is that a window function is just a group_by() without a summarize(). There's no separate syntax for "PARTITION BY".

For data analysis, whenever possible, I don't even write SQL anymore, I use the dbplyr package, which is a dplyr-to-SQL compiler.

ebb_earl_co · on April 24, 2020

I studied statistics, so R was the first programming I ever learned. I didn't know what it was at the time, but the tidyverse's chaining of operations on a data frame (via magrittr's %>%) was a really cool introduction to functional programming

phillc73 · on April 24, 2020

The pipe operator in R also sort of does my head in.

I learned R in a way that everything went right to left. The variable on the left was manipulated by whatever (functions or other code) on the right.

The pipe operator reverses that flow, with everything now moving left to right, which I find difficult to follow and debug.

jointpdf · on April 24, 2020

>The pipe operator in R also sort of does my head in.

As someone who taught myself base R from scratch in 2013, I (used to!) agree with this. When the pipe operator was first introduced, I’d roll my eyes whenever I saw a script that used it and move along.

But I forced my brain to adapt, and now it’s probably my favorite feature of the R language. Data science is full of sequences of transformations, and in my opinion it’s more readable and bug-resistant to phrase these long chains as:

  f(x) %>% g() %>% h()

rather than:

  h(g(f(x)))

or certainly:

  foo <- f(x) 
  foo <- g(foo)
  foo <- h(foo)

I can comprehend and modify others’ (to include past versions of me) R code much more quickly with this paradigm. You can quickly debug a chain by commenting out functions sequentially (i.e. first test: “f(x) # %>% ...”). It also becomes much faster to plug new transformations into the chain, when needed.

One thing that helps to keep track of the input x as it moves through the chain is using the “.” placeholder (especially when you need to specify function arguments), like so:

  f(x) %>% 
    g(., n=100, param=“baz”) %>%
    mean(.$column_name)

Here, the . stands in for “whatever is coming out of the pipe” from the left.

laGrenouille · on April 24, 2020

The best use of the pipe operator, in my opinion, is when every pipe followed by an end line, like this:

  data %>%
    filter(thing > 4) %>%
    mutate(new = fun(old)) %>%
    group_by(var) %>%
    summarize(new_mean = mean(new))

Then you are reading the code top-to-bottom, just like any other R script but without all of the temporary variables.

disgruntledphd2 · on April 25, 2020

This only really works if nothing ever fails.

It's great for data analysis and terrible for programming.

The reason it sucks for programming is because you can't debug it or inspect intermediate variables, which is really really annoying.

It's great for one off transformations and plotting, but a really, really really bad idea for programming.

Then you add NSE, which makes it hard to functionalise procedural pipes (especially for people who learned tidyverse) and it's a recipe for unmaintainable and profoundly annoying legacy code.

That bring said,I love it for interactive analysis.

ianbooker · on April 24, 2020

Thanks for sharing.

I do exactly the same, and I always find it hard to justify base-R in front of my students, after they learn about the Tidyverse ;)

clircle · on April 24, 2020

I find tidyverse a bit slow, and I don't like having to memorize all the verbs. I typically reach for data.table these days as a nice middle ground between base R and tidyverse. Your thoughts?

wodenokoto · on April 24, 2020

There’s like 3 verbs.

Select, for columns Filter, for rows Mutate, for adding rows or changing values

On top of that there are a bunch of sugared versions of these, like rename, instead of select or transmute instead of mutate.

When you want to do something advanced you either combine those or go out to see if somebody else has already made that function for you. I don’t think that having functions for joins or removing NA’s is something special to tidyverse. Everybody needs to put such operations somewhere.

I find tidyverse to have a simple, expandable vocabulary that is easy to learn, read and grow.

But it is slow.

I never found any kind of logic behind data.tables API, and I find it hard to read. I haven’t found an introduction to it that makes anything click.

But it is fast and it has these absolutely amazing rolling joins

nojito · on April 24, 2020

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/#i...

This is how I learned dt.

wodenokoto · on April 24, 2020

Thank you for the link. I still end up confused.

How does selecting the 3rd column,

    dt[, 3]

jive with

>`DT[i, j, by, ...]` which means: “Take DT, subset rows using i, then calculate j ...

I don’t understand how column selection and calculating an expression is the same thing.

So now I can’t figure out if

    DT[, sum(V1)]

Will return the column indexed at the sum of V1 or if it will return the sum of the column called V1. How can I tell, from the syntax?

nojito · on April 25, 2020

j is always for columns.

So for your first example it grabs the third column. The calculation you’re performing is a column selection.

For your second example, you’re returning all the rows, selecting column V1, and summing it.

I really prefer to name the operations

    dt[rows, columns, groups]

kyllo · on April 24, 2020

don't forget the crucial group_by() / summarise()

acomjean · on April 24, 2020

Thanks too. I had not heard of the "TidyVerse".

The TidyVerse is a opinionated group of R packages that all function similarly. ggplot2 is in there, and I already use that one.

I do like R for analysis. (I choose to use it over excel or Stata for my data analysis/ classes), though I wish the help pages were better.

Its a hard language to grok, even the data type names are unclear (data frame, vs matrix, vs vectors..). I'll take a look at the tidyverse packages.

https://www.tidyverse.org/learn/

any additional learning resources you'd recommend?

NeutralCrane · on April 24, 2020

As the other commenter said, probably the best resource out there is "R for Data Science" by Hadley Wickham, the architect of the tidyverse. It will get you up and running with the most important parts of the tidyverse (dplyr for data manipulation, ggplot2 for data visualization, tidyr for general data cleaning utilities, etc). Its available online for free here: https://r4ds.had.co.nz/

wodenokoto · on April 24, 2020

The “R for Data Science” book by Hadley Wickham (creator of tidyverse, and I believe he is chief data scientist at R Studio) is hands down one of the best introductions to data exploration and analysis.

Eugeleo · on April 24, 2020

To add onto others’ answers, Advanced R from the same author dives more into the inner wirkings of R (such as lists X vectors, what are classes etc). It helped me understand why my code works the way it does.

teruakohatu · on April 24, 2020

Not a whole lot is new. Highlights (from my perspective) :

* matrices are now a subclass of arrays.

* better reference counting to save on memory.

* stringsAsFactors = FALSE is now default breaking a lot of packages (was this really worth changing?)

* a new way of defining raw strings: r"(...)"

* better default colors for plots. The new default color scheme is less saturated. Apart from looking better, in the old scheme something plotted in bright yellow on a white background was hard to read. Yellow is now more of a mustard.

jabl · on April 24, 2020

> * stringsAsFactors = FALSE is now default breaking a lot of packages (was this really worth changing?)

In my experience, both when I learned R myself back in the day, and teaching others to use it, this is something that trips up newbies over and over again. So from that perspective changing it is good.

OTOH, a lot of existing code will need updating. Oh well..

_Wintermute · on April 24, 2020

> stringsAsFactors = FALSE is now default breaking a lot of packages

I know this is popular on the surface, but it's going to break so many "reproducible" analyses from manuscripts and scientific studies.

kachnuv_ocasek · on April 24, 2020

Reproducibles should be tied to specific versions of software or even OS and hardware. So no, this is not going to break correctly designed analyses. It's only going to break those which were not reproducible in the first place.

_Wintermute · on April 24, 2020

Hence the quotation marks, the state of software engineering in many scientific fields isn't very good, you'll be lucky if people list which package versions they used.

kachnuv_ocasek · on April 24, 2020

Fair enough. I might have misunderstood your remark originally, though the point stands, of course.

_fnhr · on April 24, 2020

There is a global option controlling this. So if anyones analysis is broken, they can still set options(stringsAsFactors=TRUE).

Furthermore the functions that are inherently about factors (like expand.grid and as.data.frame.table) still use stringsAsFactors=TRUE.

SifJar · on April 24, 2020

> So if anyones analysis is broken, they can still set options(stringsAsFactors=TRUE).

But what are the chances that someone stumbling upon supposedly reproducible analysis happens to know that in R >4.0.0 they need to do this? Especially if the original analysis didn't even specify that it was originally run on R <4.0.0

einpoklum · on April 25, 2020

There's a good chance they'll assume they need an R version from when the analysis was published, or to enable some sort of "old version compatibility mode" (does R have that?)

hugh-avherald · on April 24, 2020

But it's a major version change and that's what major version changes are for.

wodenokoto · on April 25, 2020

What’s the benefit of the matrix change?

I’ve never really worked with matrix types in R. What can I do in R4, that I couldn’t in R3? What mistakes will I avoid doing now that matrices are a subclass of arrays?

s1t5 · on April 24, 2020

* stringsAsFactors = FALSE is now default breaking a lot of packages (was this really worth changing?)

Definitely worth it. Not sure how many packages it will break and whether it would count as breaking at all.

nathell · on April 24, 2020

Finally! I always forget to add it, and it inevitably causes frustration.

mike_ivanov · on April 24, 2020

Congratulations to the people behind the release. Thank you guys very much.

Also, please don't pay attention to the haters. There are people who hate R, and those who wholeheartedly despise Python; those who loathe Java and plenty of those who can't stand Lisp. Though, R for some reason tends to trigger the most exemplary manifestations of this phenomenon.

roenxi · on April 24, 2020

The breaking changes are all quite severe, I hope there is real demand for the features they are bringing in.

Even if they don't agree with the direction Wickham's Tidyverse is going it showcased how flexible the R language was to being rewritten from the inside. Hadley effectively pulled of a more significant language upgrade with less breakages - the R core team could learn something from him here. Even their sensible na.rm default for new functions is introducing more weird inconsistencies to R.

mellavora · on April 24, 2020

Tidyverse introduces plenty of breakages, and as noted elsewhere does relentlessly between its own versions. Just check how the docs have evolved on ggplot2 for one example.

Tidyverse seems to be very much aimed at solving the initial learning curve. However, ANY system which uses allows non-experts to fake expertise denies them the ability to achieve real expertise, and also takes away skills from existing experts.

I'm grumpy about some of the reasons given for Tidyverse superiority: The Stringr package is supposedly somehow better because all of its functions start with "string" so you know you are working with strings. Huh? Because it replaces/wraps a lot of functions which begin with grep so you know you are working with regular expressions. Should we also create a "numbr" package which renames every function which works with "numbers"? Or maybe a subset of numbers, maybe we need a package called "integr".

data.table blows tbs out of the water, once you learn how to use it. Yes, it is harder to learn. Yes, it is very much worth it.

Yes, Hadley pulled off a language upgrade, and with the power of RStudio behind him has a strong hand indeed.

Just wonder if it is better or worse?

canjobear · on April 24, 2020

Stringr is better because the functions are all vectorized in simple and consistent ways.

Breza · on May 2, 2020

Agreed! String manipulation functions in base R work differently in ways that make data cleaning annoying. Stringr frees me from having to remember that stuff and lets me focus on what I'm trying to do instead.

teruakohatu · on April 24, 2020

At this point Wickham is almost developing a parallel language that happens to be backwards compatible.

rankam · on April 24, 2020

Has the R core team publicly stated that they disagree with direction Wickham's Tidyverse is going? Genuinely asking as I love the Tidyverse, but would be interested to hear arguments against it.

_fnhr · on April 24, 2020

I am not sure why the original commenter said that tidyverse is more backwards-compatible compared with R core. It used to introduce breaking changes every 6 months or so.

Also they like it this way and promote it: https://twitter.com/hadleywickham/status/1175388442802479104

disgruntledphd2 · on April 24, 2020

They still do.

I like the tidyverse, but Hadley's struggles with lazy evaluation and arguments has cost me lots and lots of time updating internal code at various workplaces.

Don't get me wrong, the tidyverse is great, but if I was writing R code that I expected to run without supervision for a long time, I'd avoid it as much as possible.

baldfat · on April 24, 2020

There are more then enough ways to keep it going. This has been address by several tools.

Personally I have some tiddy code that is 8 years old and it still is working.

mellavora · on April 24, 2020

Yes, and I have plenty of it which does not.

danielecook · on April 24, 2020

If you are looking for a faster, more concise alternative I highly recommend data.table.

iron0013 · on April 24, 2020

perl is also very concise

jasonpbecker · on April 24, 2020

Moving up to a major version (1.0) implies there could have been breaking changes if you follow semver. And including something like `unnest_legacy()` is helpful for people making the transition.

Just like `stringsAsFactors=FALSE` happened on a transition from 3.x to 4.x, because it is breaking, dplyr 1.0 had breaking changes.

Tarq0n · on April 24, 2020

Hadley is a member of the R foundation for what it's worth: https://www.r-project.org/foundation/members.html

_fnhr · on April 24, 2020

R core and R foundation are different things: https://www.r-project.org/contributors.html

mikevm · on April 24, 2020

R is the quirkiest hard-to-remember programming language I have ever tried learning.

bitcharmer · on April 24, 2020

Kudos to all contributors. Learning R and using it was a lot of fun and really satisfying experience. Great option for someone like me, who has to do some basic analysis of large data sets without rich statistical background.

flashyfaffe2 · on April 24, 2020

Did you learn R for your own fun or just to upgrade your skill set and get a job?

bitcharmer · on April 24, 2020

In my case it was Q vs R. For what I was trying to achieve and with my level of statistics background R was an obvious choice. The language/runtime AND RStudio are entirely free. Give it a try!

ryndbfsrw · on April 24, 2020

kdb+ is the only tool I've come across that solidly managed to outperform data.table for data processing. I have huge respect for good Q programmers

bitcharmer · on April 24, 2020

Likewise. A good kdb dev is a very rare and intelligent beast, hats off. I'm just too stupid to be able to comprehend much of Kx's ideas and so R was just much more appealing to me.

anonu · on April 24, 2020

In my experience even bad Q programmers can outperform. The language and underlying engine is pretty efficient...

mirekrusin · on April 24, 2020

...better set of options: 1. for fun 2. get a job or job done 3. all of above

kermatt · on April 24, 2020

I’m disappointed 64-bit integers are not mentioned.

That is an unusual omission in a current language, and the add on packages are not always a substitute.

Is anyone here aware if that is planned to be a core addition?

jerzyt · on April 24, 2020

My daughter is in grad school, learning programming for the first time in her life. She's learning R and Python at the same time. She's finding R much more intuitive.

rossdavidh · on April 24, 2020

Speaking as a programmer, R is deeply unintuitive, but...that's for an experienced programmer's intuition. I believe it is, if not intuitive, then at least less unintuitive for someone with a background other than programming (like, say, statistics).

The typical programmer's boggled response to R, somewhat like their response to SQL or CSS, is that most of the languages they look at are descended from C, and anything that is not looks weird. Rather like someone who knows only Indo-European languages encountering a non-IE language for the first time.

Of course, any real-life language, whether R or CSS or SQL or C or anything else, has plenty of actual defects to complain about. Like, for example, "=" meaning "change the thing on my left to be the same as what's on my right". But if it's the same defects that you're used to, you don't see them as much.

jerzyt · on April 24, 2020

I agree that if you have a background in a common programming language, R is not intuitive. I know it from personal experience. However, for someone with no programming background, R may be quite reasonable. Just judging by my daughter's experience.

rossdavidh · on April 26, 2020

Agreed.

danso · on April 24, 2020

Is she being taught a particular framework in Python, e.g. pandas for doing statistics? When I was teaching programming to non-CS grad students, I chose Python for being more all-purpose and explicit in its syntax/conventions (the latter being not too important for most novices). But I envied the relative ease of setup for R – basically, download R Studio – and how quickly anyone can turn a dataset into nice-looking viz thanks to ggplot2.

jerzyt · on April 25, 2020

Being taught is not really what's taking place. It's more like here's the task, for example the LSVT dataset from UCI - go and build a classifier. Obviously, pandas is a workhorse, and pytorch is also on the horizon. But, in reality she's learning by Google and skype calls with Dad. I have to say I'm stoked that I'm being able to help my daughter in grad school with homework. So many people tap out in elementary.

downerending · on April 24, 2020

Hmm. A pocket knife is more intuitive than a CNC machine, especially to someone who's never seen either. Hard to recommend it for most purposes, though, particularly when scale and quality matter.

rgovostes · on April 24, 2020

R is the worst programming language I've touched. I usually hear in response, "it may be quirky, but it has a lot of well-tested functions useful for data analysis."

This confidence seems misplaced. Experienced developers using better languages with better tooling create bugs all the time. How likely is it that R packages, written by specialists (or grad students) whose main focus is often not programming but some other discipline, in a language full of "quirks," are going to be really reliable?

One quirk that got me the last time I touched R was that functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

Fun fact, a lot of the R source code is machine translated from Fortran to C, but it has been cleaned up a lot. https://github.com/wch/r-source/search?p=1&q=f2c&unscoped_q=...

alexhutcheson · on April 24, 2020

R's killer app is interactive analysis. Import some data, graph it, look at the graph, do some clean-up based on what you noticed, graph it again, run some summary statistics, fit a model, graph the residuals from the model, etc. It's amazing for this - arguably the best available open source software for this type of task. Comparable tools for this would be MATLAB, Stata, Mathematica, etc. (all with strengths in slightly different domains).

It's comparatively very weak for "software development" - writing libraries, long-running programs, etc. Even data cleaning/data analysis scripts are a headache beyond a certain length or number of contributors. People tend to have the same complaints about MATLAB, Stata, and Mathematica.

These are related - a lot of the features that make interactive use concise and easy to pick up end up being kind of a mess to handle consistently in a larger program.

Python is at a different point on this spectrum. It's moderately good for interactive analysis, but there are still a lot of tasks that are concise and easy in R that require a bunch of verbose object manipulation in Python. This is a choice - "explicit is better than implicit", etc. Python is also moderately good for "software development" - it has enough consistency and code structuring features that you can write libraries, systems, and infrastructure while collaborating with a small number of other people pretty easily. You still hit a point where Python's dynamic features make things like refactoring harder than you'd like, but it's usable.

hdkrgr · on April 24, 2020

Wholeheartedly agree.

I used to do mostly data analysis in my day-to-day work and R was my go-to and absolute favorite language for years in terms of usability for data analysis. Doing actual software development in R is quirky at best, to be honest.

Nowadays I write code for research that requires 'actual' software development, so I've been using python almost exclusively (with pytorch under the hood, which I love.) No doubt, python is a better language for software engineering.

Nevertheless, for analysis I just cannot warm up to numpy/pandas/matplotlib _at all_. When it's time to analyze results of my experiments or produce publication level graphics, I write my python results to disk and use the tidyverse as a last mile solution.

yters · on April 24, 2020

This is exactly why I still use R. Throw data in a file, suck it into R, start playing with slicing and visualization with a few commands. Once I get something interesting, a few more tweaks and I have a nice, descriptive graph.

notafraudster · on April 24, 2020

RE: functions/variables -- you don't quite have this right. Let's walk through when this can occur and when this can't. First, this cannot occur if you're talking about your local environment:

  z <- function() { print("hello world") }
  z <- 3
  z # Outputs 3
  z() # Errors because there is no function z()

What the actual quirk you're talking about has to do with attaching other namespaces. So, the way R works is that by default it loads several packages. For instance, the reason you can call `mean()` is that the "base" package is attached when you open a blank session in R. It is absolutely possible to have multiple symbols take the same label across namespaces. Here's an example:

  mean <- 4 # Creates variable called mean in the local namespace
  mean(c(5, 10, 20)) # Uses the mean function in the attached base namespace
  mean # Uses the mean variable in the local namespace

But this is not unique to variables versus functions, it also works with functions:

  mean <- function(x) { sum(x) * 100 }
  mean(c(1, 2, 3)) # Outputs 600
  base::mean(c(1, 2, 3)) # Outputs 2
  rm(mean) # Drop local namespace function
  mean(c(1, 2, 3)) # Outputs 2

You can use sessionInfo() to see the order of attached namespaces that will be searched. In RStudio, you can also press the little dropdown arrow next to "Global Environment" in the environment pane to see the order of attached namespaces -- it'll search them in that order. Alternatively, you can be defensive and always prefix all functions in any namespace with the namespace name (equivalent to always using module.function style function calls in python).

Finally, you can use the built-in function "find" to see the order in which R will try to resolve a symbol, e.g.

  sum <- function() { print("i like sum coding in R!!!") }
  find("sum") # .GlobalEnv first, base second

I'm not sure this is a quirk. What are your options as a namespaced language? 1) Allow users to import multiple functions/variables with the same name across different namespaces and resolve the conflict via some kind of hierarchical order; 2) Don't allow users to import multiple functions/variables with the same name, so you can never use overloading to monkey patch; 3) Always require namespace prefixes at all times; 4) Make function dispatch a blocking operation that asks REPL users which to use? I dunno, it doesn't seem to me like R's approach is any less sane.

I guess one thing that is different about R's approach is that "built-ins" have no special priority, they're all part of some namespaces that are attached by default but otherwise exactly like third-party libraries (base, stats, graphics, etc.)

In R, the only reserved words that cannot be overloaded are while, repeat, next, if, function, for, and break. (Note: else does not appear here because of a genuinely baffling quirk about how else is implemented in R)

j7ake · on April 24, 2020

As much as I love R, there are indeed quirks that I think R allows to happen but shouldn't.

For example:

    mean(2, 4) # outputs 2
    mean(c(2, 4)) # outputs 3

I really don't see why R allows you to enter mean(2,4) without giving an error.

rgovostes · on April 24, 2020

Thanks for the explanation. Yes, your second example is something like what I found in R code I was asked to modify.

What I find distasteful is that when calling mean(), the resolution of this name depends NOT on whether the local variable mean has been defined, but whether it has been assigned a function. This is illustrated by your 2nd and 3rd examples.

Of course if you are used to it, it may not catch you by surprise.

playing_colours · on April 24, 2020

Can an alias be an option to resolve conflict: “import mypackage.mean as mymean”?

hdkrgr · on April 24, 2020

No aliases (afaik), but convention is that you explicitly call mypackage::mean in cases where names might even hint at being ambigous.

credit_guy · on April 24, 2020

On a tangent here, but is there a reason to use <- instead of = for assignment?

Actually, to answer my own question (I know google), I found this very informative stackoverflow answer [1]. My TLDR: no difference, other than "<-" is more likely to cause carpal tunnel syndrome.

[1] https://stackoverflow.com/questions/1741820/what-are-the-dif...

anthony_doan · on April 24, 2020

> How likely is it that R packages, written by specialists (or grad students) whose main focus is often not programming but some other discipline, in a language full of "quirks," are going to be really reliable?

Do you have the data to back that up or is just a question that you don't have an answer to?

I think the statement brush off the high quality R packages that already in the ecosystem that is not in anywhere else. The author of element of statistic learning created glmnet package which took python years to even have ported. There are many other packages out there that Python does not have.

I am not going to argue it's the prettiest language.

But there are tons of packages and many found no where else in other language ecosystem.

If you're going to say just use rest/rpc then it just defeat the purpose of your initial argument.

And just look at springer or, gosh the other publisher escaped my mind right now, but these publishers have book on statistical subjects with R packages accompany the book.

It may be ugly but the packages are maintain by expert in the field of statistic and they may not be programmer. But they do dog food their package and use dataset to see the results. Likewise just read up on the Ranger Rpackage paper (https://arxiv.org/pdf/1508.04409.pdf). They test their output with the other randomforest package.

And it's silly to point this argument at just R when the same happen to Python. The bootstrap function in SciKit-Learn for the longest time didn't even really do bootstrap. The linear regression function automatically does shrinkage with no option to turn off.

No language is perfect. But I believe R have a place and especially in statistic. Many many wonderful expert statistician are maintain and creating R packages (eg Dr. Frank E. Harrell Jr. ) Many R packages have accompanying paper publish here https://www.jstatsoft.org/index

kazinator · on April 24, 2020

> functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

The two namespace approach makes it harder to assign 4 to foo and then call it, so I don't understand this comment.

You can assign 4 to foo and try to call it in a language that has one namespace.

  ;; Scheme, one namespace:
  (let ((foo 4))
    (foo) ;; oops

  ;; Common Lisp, two spaces:

  (let ((foo 4))
    (foo) ;; still calls the foo function,
          ;; not related to or shadowed by the above variable.
    (funcall foo)) ;; oops

The usual valid complaint about two namespaces is that funcall is required all over the place in code that works with higher-order functions, which uglifies the code, and that an operator like (function foo) or its abbreviation #'foo is required to lift references to functions as values instead of just writing foo, which likewise uglifies code.

glup · on April 24, 2020

R to Python convert here (numpy, pandas). Agreed about the general claim of clunkiness, but at least for statistical computing R still wins because of the richness of the long tail of packages (representative example: https://cran.r-project.org/web/packages/poweRlaw/index.html). In my experience, Python equivalents are much less developed and documented, if extant. Many data scientists I know would disagree with this claim, but that's because they tend to stick with things supported by scipy, pymc3, statsmodels, and a few other common libraries.

One solution I have found for small and medium data is to use rpy in Jupyter to let me keep most of my workflow in python, then shuttle stuff to R for exotic tests or to use key packages (ggplot, brms, lme4).

lottin · on April 24, 2020

Separate namespaces for function names and data variables is a feature inherited from Lisp which is the language R evolved from. Not really a quirk. Common Lisp and Emacs Lisp also have separate namespaces, while Scheme has a single namespace.

The thing about a lot of R code not being written by programming specialists is a valid point, but then again not many programming specialists are also specialist in statistics so... the alternative is what?

Mikeb85 · on April 24, 2020

R might be a bad choice to build software but for statistics and data analysis I haven't encountered a better tool. It's also trivial to simply write functions in C/C++/Java/Fortran and call it from R.

Breza · on May 2, 2020

Exactly! I use R every day as the head of the data science department at a corporation. Most of my work is medium-sized data analysis projects and nothing can touch R for that level of work.

coliveira · on April 24, 2020

> functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

The most famous language with this property is Lisp.

kazinator · on April 24, 2020

Unix shells also compete in this category.

We can have a $foo variable, and a foo command/function.

GNU Make is another one, sort of.

Makefile:

  warning = abc
  $(info abc = $(warning))
  $(warning what?)

Output:

  abc = abc
  Makefile:3: what?
  make: *** No targets.  Stop.

$(warning) is a variable, whereas $(warning args ...) is an operator call.

If a macro is stored in a variable V, it cannot be called as $(V args), but using $(call V args). That's analogous to funcall in Common Lisp.

Looks like the Lisp-2 approach is well represented in the famous language scene. :)

salmon30salmon · on April 24, 2020

You can't really think of R as a programming language in the sense of how we normally thing of programing languages. I see it as more of a computational scripting language. It is great at what it does.

s1t5 · on April 24, 2020

At the same time there are companies in which there's R code running in production to serve ML results in real time. No idea how common it is, but I've worked for such a company. My point is that R isn't just for scripting and interactive use even if that's its best use case.

meztez · on April 24, 2020

A plumber in front of your ML serving function running in a docker container. There, simple, fast enough. We have multiple R containers in production, it is working fine and the data scientist are happy that they can keep working in RStudio.

s1t5 · on April 24, 2020

> One quirk that got me the last time I touched R was that functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

Either I'm greatly misunderstanding or this is just plain wrong - assigning something to the same name as an existing function will just overwrite the function.

notafraudster · on April 24, 2020

You're correct provided the overwrite is occurring in the same namespace (e.g. the local environment). He is correct for the wrong reason if you try to alias something from another environment. For example, the following R code may initial seem weird:

   my_vec <- c(1, 2, 3)
   c <- 4
   print(c)
   my_vec2 <- c(5, 6, 7)

But as I describe in my top level comment, it has nothing to do with variables versus functions, just how namespaces work.

sct202 · on April 24, 2020

If you've gone from SAS to R, R seems very much improved.

Florin_Andrei · on April 24, 2020

> it has a lot of well-tested functions

I tried to do a simple word cloud based on a Twitter search in R. The whole thing was plagued by mysterious failures due, AFAICT, to weird, rare, non-standard UTF characters in the source data that were crashing some of the R libraries I was using to clean the data.

Having done that before in Python, it was shocking to see how fragile the R ecosystem is.

rgovostes · on April 24, 2020

As another example: I tried to get some R code distributed with MPI using a package that claims to do this. If I remember correctly, the package would generate and execute a shell script to spin up subprocesses. Then it would shuttle code (including huge matrices serialized to ASCII) over a socket, to be eval()'d on the other side. The secret codeword "kreab" would terminate the connection, so this could appear nowhere in the code sent over the socket.

This is clownshoes software quality. And it's used for important scientific research.

SteveSar · on April 25, 2020

First off, R may be annoying to learn (for instance, just concatenating two strings takes much more typing than Python, say) but once you get to data analysis it’s a dream. (And certainly more Pythonic than Python’s alternatives, which result in code cluttered with lots of periods and references to packages and sub-packages everywhere). Also, the packages work really well with each other—-with Python it seemed like every time I’d upgrade packages with pip, I’d end up with a new conflict I’d have to sort out. And, frankly, having to rely on Anaconda to keep everything from breaking is not exactly ideal. So props to CRAN!

Gotta say, though, one of the most annoying things about R is the name. Entering “R” as a search term on job boards tends to lead to a lot of not-very-helpful results.

Also, it’d help if the version release names had more order to them. So you’d read some ridiculous phrase like “parachute trombone” and know that it must’ve been released after “mouse parade”.

enricozb · on April 24, 2020

Is it still the case that when you have an error in your script (and aren't using RStudio) that you don't get the line number of the error, you just get a traceback?

einpoklum · on April 25, 2020

What impact does this have on people _learning_ R?

It's a new major version - does it require significant updates for training materials? Are there a lot of outdated idioms now?

stilisstuk · on April 24, 2020

I love R. But working with 50 million rows in ram is a pain. Anyway to do it differently? I use data.table BTW, so asfactor is always false

jointpdf · on April 24, 2020

For what purpose? There are tons of ways to do it differently, see here: https://cran.r-project.org/web/views/HighPerformanceComputin...

scottlocklin · on April 25, 2020

Buy more ram?

If you're regularly running into memory performance issues despite using data.table, and you don't feel like fiddling with mmap, consider looking into an APL variant.

stilisstuk · on April 25, 2020

Will look into mmap. Thx

rwmj · on April 24, 2020

Please start release announcements by saying briefly what it is! "R is a programming language for statistical analysis" or whatever. Don't assume that people are familiar with every possible piece of software already.

Also pull important stuff to the top of announcement. Were there security issues that need you to upgrade? What are the major new features that would encourage you to upgrade?

For example here's a release announcement I wrote recently: https://www.redhat.com/archives/libguestfs/2020-February/msg...

Free software doesn't usually have an advertising budget, so you have educate people on what your software does at every opportunity you get.

sigwinch28 · on April 24, 2020

This was posted to the r-announce mailing list.

I think it's reasonable to assume that this message was intended for subscribers of that mailing list, and that the people who chose to subscribe already know what R is.

rwmj · on April 24, 2020

And yet the message appears here, on a general tech news site, and no doubt many other places. Assuming only R-announce readers will see it is plainly wrong.

codegladiator · on April 24, 2020

> For example here's a release announcement I wrote recently: https://www.redhat.com/archives/libguestfs/2020-February/msg....

Request to please start release announcements by saying what really a "Network Block Device (NBD) server" is. Don't assume that people are familiar with every possible piece of software already.

rwmj · on April 24, 2020

This is a fair point. I think earlier versions did try to explain what NBD is but they were a bit wordy, and just expanding NBD was deemed sufficient. Next time I'll see if I can get that better.

Of course there's a level beyond which you don't really need to go - I wouldn't suggest explaining what Linux is or what software is.

codegladiator · on April 24, 2020

Frankly my reply was just rhetoric, and I don't like the suggestion of explaining the base of things in release notes. If someone is interested enough they would care to look it up. Those who are not interested would skip the notes anyways.

Release notes are usually for people who are already using it.

timthorn · on April 24, 2020

To be fair, the announcement was posted to the R-announce mailing list, and it seems probable that subscribers would know what R is.

rwmj · on April 24, 2020

But here we are reading it on a general tech news site.

gumby · on April 24, 2020

Thats not the fault of the R developers who wrote a message appropriate for the audience to which they sent it.

rwmj · on April 24, 2020

And they never imagined that the announcement of a major release of a popular piece of software would go beyond the mailing list?

touktouk · on April 24, 2020

Should every message be written with all possible audiences in mind?

rwmj · on April 24, 2020

It's lucky I never said that because obviously that would be stupid. However the release of a major version of a popular piece of software should be expected to go beyond the mailing list - and guess what, it being on the front page of HN all morning proves that exact point.

CDLcdi · on April 24, 2020

Looks good

DmitryOlshansky · on April 24, 2020

I hope there was an R 2d2 release at some point...

bachmeier · on April 24, 2020

There was, just not from the R team. The original name of my project for calling from R into D was called rtod2 (playing off D version 2).