Introducing Spark Datasets

srean · on Jan 11, 2016

I would be quite interested to see how Dask, Spark and Dato's(or Graphlab's) SFrames fare relative to each other. My hunch is none dominates the other totally in performance and that each would have their sweet spot. Are there any well done, executed_in_good_faith benchmarks available.

rkrzr · on Jan 11, 2016

Does somebody know why these improvements could not be integrated into DataFrames directly?

Why add another very similar layer of abstraction?

They mention:

"Unification of DataFrames with Datasets – due to compatibility guarantees, DataFrames and Datasets currently cannot share a common parent class. With Spark 2.0, we will be able to unify these abstractions with minor changes to the API, making it easy to build libraries that work with both."

Anybody know what those "compatibility guarantees" are?

nl · on Jan 11, 2016

Dataframes aren't strongly typed (while RDDs are).

It's difficult to add strong typing to an existing API.

I certainly prefer using DataFrames generally, but sometimes you need to switch to RDDs because some operations are easier.

Hopefully Datasets bridge the gap well. I haven't used them yet, but the examples look nice.

JoshRosen · on Jan 11, 2016

The blog is referring to source- and binary-API compatibility guarantees that Spark provides.

IIRC, in order for DataFrames and Datasets to share a common parent class certain common methods' return types would need to change in a binary-incompatible way that would break DataFrame code which was compiled against older Spark versions.

Due to Spark's API compatibility and versioning policies, this type of compatibility break can only occur in major releases, so it can only be done in Spark 2.0.0 and not Spark 1.6.

necrobrit · on Jan 11, 2016

Really looking forward to this dropping! Writing complex data out of spark jobs is super easy, but getting it back into strongly-typed-Scala has been a big pain point.

The experimental release is a bit broken though, encountered a pretty big bug: https://issues.apache.org/jira/browse/SPARK-12714