I would be quite interested to see how Dask, Spark and Dato's(or Graphlab's) SFrames fare relative to each other. My hunch is none dominates the other totally in performance and that each would have their sweet spot. Are there any well done, executed_in_good_faith benchmarks available.
Does somebody know why these improvements could not be integrated into DataFrames directly?
Why add another very similar layer of abstraction?
They mention:
"Unification of DataFrames with Datasets – due to compatibility guarantees, DataFrames and Datasets currently cannot share a common parent class. With Spark 2.0, we will be able to unify these abstractions with minor changes to the API, making it easy to build libraries that work with both."
Anybody know what those "compatibility guarantees" are?
The blog is referring to source- and binary-API compatibility guarantees that Spark provides.
IIRC, in order for DataFrames and Datasets to share a common parent class certain common methods' return types would need to change in a binary-incompatible way that would break DataFrame code which was compiled against older Spark versions.
Due to Spark's API compatibility and versioning policies, this type of compatibility break can only occur in major releases, so it can only be done in Spark 2.0.0 and not Spark 1.6.
Really looking forward to this dropping! Writing complex data out of spark jobs is super easy, but getting it back into strongly-typed-Scala has been a big pain point.