Google Dumps MapReduce in Favor of New Hyper-Scale Analytics System

_dark_matter_ · on June 28, 2014

>“We don’t really use MapReduce anymore”

This is not even true. They have recently published research that involved using MapReduce in their own systems. Example: http://research.google.com/pubs/pub41376.html

kiyoto · on June 28, 2014

As someone who has one foot in marketing and one foot in development (I am a developer evangelist), here is what's happening.

1. Google's infrastructure is evolving atop MapReduce (FlumeJava/MillWheel).

2. Google's PR decided to call it "not using MapReduce anymore" because in marketing, "beyond <current fad>" sounds really cool.

3. The rest of the world's PR/press/marketing fall for Google's clever PR.

Either way, it is great to see Google making its core technology accessible as part of their PaaS =)

zobzu · on June 28, 2014

hyperscale man! cyberleetscale would have been better. IMO.

Thanks for the clarifications :)

username223 · on June 28, 2014

> 3. The rest of the world's PR/press/marketing fall for Google's clever PR.

If "hyper-scale Analytics" counts as "clever PR," then eating paste counts as "clever food choice." It's not worth my time to decode the BS.

t1m · on June 28, 2014

It would be more great if they open sourced it.

rsync · on June 28, 2014

I know what I'm supposed to do when my taxi driver starts recommending stocks to me ...

But what do I do when people start describing themselves as "developer evangelists" ?

That has to be some kind of sell signal, right ?

sanswork · on June 28, 2014

The title has existed for years now at various companies(most with an open source program have them) so whatever you were suppose to do you're too late now.

alphapapa · on June 28, 2014

Looks like someone thinks you're asking too many questions. :/

j_baker · on June 28, 2014

There's a grain of truth to what he's saying. We don't write many new actual C++/Java MapReduce jobs. It's mostly flume. Although there is some momentum behind Go MapReduce.

basyt · on June 28, 2014

Its fundamentally the same thing as MapReduce isn't it? Can someone explain the differences to me please? There isn't much of use in the article

dyoo1979 · on June 28, 2014

You'll probably want to read the FlumeJava paper. http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...

Citation: http://dl.acm.org/citation.cfm?id=1806638

The key word is pipeline. If you have some analysis that runs in several stages, you'll be taking the output of one stage, and connecting it to the next. If you want to compose multiple phases, chained together, raw MapReduce isn't going to help you very much with the chaining.

What's described in the paper is a way to do the chaining in a nice way. The system will take care of writing the raw MapReduces for you. But it'll also do a lot of work on the interconnections between your stages as well.

espeed · on June 28, 2014

MapReduce wasn't designed for iterative algorithms or streaming data, whereas Google Dataflow and Spark (http://spark.apache.org/) make iterative algoritms easy. It's a much simpler programming paradigm, and it allows you to do iterative graph-processing and machine-learning algos (http://spark.apache.org/mllib/) that are impractical on MapReduce.

For example, Spark provides the primitives needed to build GraphX (http://amplab.github.io/graphx/, http://spark.apache.org/graphx/), which is essentially GraphLab on Spark.

njharman · on June 28, 2014

This has "cloud" prefixed to name of every component. So, obviously, is better. Also, they're selling it. So, ya know, marketing trumps engineering.

wyager · on June 28, 2014

>Google Replaces MapReduce With New Hyper-Scale Cloud Analytics System

Buzzword... overload!

msane · on June 28, 2014

Hey sometimes you just have to do whatever it takes to synergize global channels on virtual platforms. You know, really aggregate extensible markets with repurposed leading-edge metrics.

SnorkelTan · on June 28, 2014

But how will these leading-edge metrics enable us to deliver paradigm shifting solutions to our customers while simultaneously reducing costs and increasing operational efficiency?

source99 · on June 28, 2014

While making the world a better place!

msane · on June 28, 2014

That's the best part - it's automated with a context-sensitive customizability infrastructure. That means you can expand any targeted benchmark in a demand-driven mesh.

jey · on June 28, 2014

Is this basically like Apache Spark in its programming model?

espeed · on June 28, 2014

Yes, it's like Spark (http://spark.apache.org/) and SparkStreaming (http://spark.apache.org/streaming/) combined.

Here are the relevant papers...

* FlumeJava (iterative, data-parallel pipelines like Spark): http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...

* MillWheel (fault-tolerant stream processing like SparkStreaming): http://research.google.com/pubs/pub41378.html

Pointers to the IO blog posts...

* "Reimagining developer productivity and data analytics in the cloud" http://googlecloudplatform.blogspot.com/2014/06/reimagining-...

* "Sneak peek: Google Cloud Dataflow, a Cloud-native data processing service" http://googlecloudplatform.blogspot.com/2014/06/sneak-peek-g...

The Dataflow-specific talks at Google IO 2014...

* Big data, the Cloud way: Accelerated and simplified https://www.youtube.com/watch?v=Y0Z58YQSXv0

* The dawn of "Fast Data" https://www.youtube.com/watch?v=TnLiEWglqHk

* Predicting the future with the Google Cloud Platform https://www.youtube.com/watch?v=YyvvxFeADh8

* Keynote (starts at Urs Hölzle's segment on Google Cloud) https://www.youtube.com/watch?v=wtLJPvx7-ys#t=6932

jey · on June 28, 2014

Cool. Does this mean Google is moving away toward languages that allow for easier use and serialization of closures than in C++ and Java? (For example, Spark uses Scala natively.)

espeed · on June 28, 2014

Dataflow is language agnostic. The Java API is being released first, and more languages will follow.

entrusted · on June 28, 2014

The main takeaway from this article is that the author, Yevgeniy Sverdlik, has demonstrably never worked with distributed computing systems.

The rest is buzzwords propping up sweeping ridiculous conclusions.

miralabs · on June 28, 2014

"said it got too cumbersome once the size of the data reached a few petabytes."

I dont think there's a lot of companies where data would reach this huge. Anyone has any idea on how large a typical warehousing database is?

jamesaguilar · on June 28, 2014

What do you mean, warehousing? Like, item tracking inside an actual warehouse? Hard to imagine spending more than a kB per unique item -- more per SKU, but less per individual object -- so even if you have 1M items being tracked, the total size would only be a gigabyte. Even if you had a billion unique things in your store, the resulting database would still fit on a single flash drive.

seanp2k2 · on June 28, 2014

Given the context, it seems like "warehouse" in parent's comment was more specifically "data warehouse": http://en.m.wikipedia.org/wiki/Data_warehouse

While I couldn't quickly find anything that speaks to any kind of average or normal size of a data warehouse, this article mentions Facebook's being around 300PB: https://code.facebook.com/posts/229861827208629/scaling-the-...

frozenport · on June 28, 2014

Its about what they need and not what other companies need.

neckbeard · on June 28, 2014

For those that missed it, similar discussion on this a couple of days ago: https://news.ycombinator.com/item?id=7947782

dang · on June 28, 2014

Yes, it's the same story, so we'll call this thread a dupe.

t1m · on June 28, 2014

Is it just me or is using 'cloud' in their product names just shark jumping.

Seriously.

capkutay · on June 28, 2014

This sounds more like Google app engine's shot at amazon kinesis than anything else.

CHY872 · on June 28, 2014

Sounds a bit like Dryad?

jpgvm · on June 28, 2014

Yay, marketing, yay.

infocollector · on June 28, 2014

More snake oil? :-)