Hacker News new | past | comments | ask | show | jobs | submit login
Introducing Spark Datasets (databricks.com)
78 points by jonbaer on Jan 11, 2016 | hide | past | favorite | 5 comments



I would be quite interested to see how Dask, Spark and Dato's(or Graphlab's) SFrames fare relative to each other. My hunch is none dominates the other totally in performance and that each would have their sweet spot. Are there any well done, executed_in_good_faith benchmarks available.


Does somebody know why these improvements could not be integrated into DataFrames directly?

Why add another very similar layer of abstraction?

They mention:

"Unification of DataFrames with Datasets – due to compatibility guarantees, DataFrames and Datasets currently cannot share a common parent class. With Spark 2.0, we will be able to unify these abstractions with minor changes to the API, making it easy to build libraries that work with both."

Anybody know what those "compatibility guarantees" are?


Dataframes aren't strongly typed (while RDDs are).

It's difficult to add strong typing to an existing API.

I certainly prefer using DataFrames generally, but sometimes you need to switch to RDDs because some operations are easier.

Hopefully Datasets bridge the gap well. I haven't used them yet, but the examples look nice.


The blog is referring to source- and binary-API compatibility guarantees that Spark provides.

IIRC, in order for DataFrames and Datasets to share a common parent class certain common methods' return types would need to change in a binary-incompatible way that would break DataFrame code which was compiled against older Spark versions.

Due to Spark's API compatibility and versioning policies, this type of compatibility break can only occur in major releases, so it can only be done in Spark 2.0.0 and not Spark 1.6.


Really looking forward to this dropping! Writing complex data out of spark jobs is super easy, but getting it back into strongly-typed-Scala has been a big pain point.

The experimental release is a bit broken though, encountered a pretty big bug: https://issues.apache.org/jira/browse/SPARK-12714




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: