Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] Google Dumps MapReduce in Favor of New Hyper-Scale Analytics System (datacenterknowledge.com)
41 points by posharma on June 28, 2014 | hide | past | favorite | 34 comments



>“We don’t really use MapReduce anymore”

This is not even true. They have recently published research that involved using MapReduce in their own systems. Example: http://research.google.com/pubs/pub41376.html


As someone who has one foot in marketing and one foot in development (I am a developer evangelist), here is what's happening.

1. Google's infrastructure is evolving atop MapReduce (FlumeJava/MillWheel).

2. Google's PR decided to call it "not using MapReduce anymore" because in marketing, "beyond <current fad>" sounds really cool.

3. The rest of the world's PR/press/marketing fall for Google's clever PR.

Either way, it is great to see Google making its core technology accessible as part of their PaaS =)


hyperscale man! cyberleetscale would have been better. IMO.

Thanks for the clarifications :)


> 3. The rest of the world's PR/press/marketing fall for Google's clever PR.

If "hyper-scale Analytics" counts as "clever PR," then eating paste counts as "clever food choice." It's not worth my time to decode the BS.


It would be more great if they open sourced it.


I know what I'm supposed to do when my taxi driver starts recommending stocks to me ...

But what do I do when people start describing themselves as "developer evangelists" ?

That has to be some kind of sell signal, right ?


The title has existed for years now at various companies(most with an open source program have them) so whatever you were suppose to do you're too late now.


Looks like someone thinks you're asking too many questions. :/


There's a grain of truth to what he's saying. We don't write many new actual C++/Java MapReduce jobs. It's mostly flume. Although there is some momentum behind Go MapReduce.


Its fundamentally the same thing as MapReduce isn't it? Can someone explain the differences to me please? There isn't much of use in the article


You'll probably want to read the FlumeJava paper. http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...

Citation: http://dl.acm.org/citation.cfm?id=1806638

The key word is pipeline. If you have some analysis that runs in several stages, you'll be taking the output of one stage, and connecting it to the next. If you want to compose multiple phases, chained together, raw MapReduce isn't going to help you very much with the chaining.

What's described in the paper is a way to do the chaining in a nice way. The system will take care of writing the raw MapReduces for you. But it'll also do a lot of work on the interconnections between your stages as well.


MapReduce wasn't designed for iterative algorithms or streaming data, whereas Google Dataflow and Spark (http://spark.apache.org/) make iterative algoritms easy. It's a much simpler programming paradigm, and it allows you to do iterative graph-processing and machine-learning algos (http://spark.apache.org/mllib/) that are impractical on MapReduce.

For example, Spark provides the primitives needed to build GraphX (http://amplab.github.io/graphx/, http://spark.apache.org/graphx/), which is essentially GraphLab on Spark.


This has "cloud" prefixed to name of every component. So, obviously, is better. Also, they're selling it. So, ya know, marketing trumps engineering.


>Google Replaces MapReduce With New Hyper-Scale Cloud Analytics System

Buzzword... overload!


Hey sometimes you just have to do whatever it takes to synergize global channels on virtual platforms. You know, really aggregate extensible markets with repurposed leading-edge metrics.


But how will these leading-edge metrics enable us to deliver paradigm shifting solutions to our customers while simultaneously reducing costs and increasing operational efficiency?


While making the world a better place!


That's the best part - it's automated with a context-sensitive customizability infrastructure. That means you can expand any targeted benchmark in a demand-driven mesh.


Is this basically like Apache Spark in its programming model?


Yes, it's like Spark (http://spark.apache.org/) and SparkStreaming (http://spark.apache.org/streaming/) combined.

Here are the relevant papers...

* FlumeJava (iterative, data-parallel pipelines like Spark): http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...

* MillWheel (fault-tolerant stream processing like SparkStreaming): http://research.google.com/pubs/pub41378.html

Pointers to the IO blog posts...

* "Reimagining developer productivity and data analytics in the cloud" http://googlecloudplatform.blogspot.com/2014/06/reimagining-...

* "Sneak peek: Google Cloud Dataflow, a Cloud-native data processing service" http://googlecloudplatform.blogspot.com/2014/06/sneak-peek-g...

The Dataflow-specific talks at Google IO 2014...

* Big data, the Cloud way: Accelerated and simplified https://www.youtube.com/watch?v=Y0Z58YQSXv0

* The dawn of "Fast Data" https://www.youtube.com/watch?v=TnLiEWglqHk

* Predicting the future with the Google Cloud Platform https://www.youtube.com/watch?v=YyvvxFeADh8

* Keynote (starts at Urs Hölzle's segment on Google Cloud) https://www.youtube.com/watch?v=wtLJPvx7-ys#t=6932


Cool. Does this mean Google is moving away toward languages that allow for easier use and serialization of closures than in C++ and Java? (For example, Spark uses Scala natively.)


Dataflow is language agnostic. The Java API is being released first, and more languages will follow.


The main takeaway from this article is that the author, Yevgeniy Sverdlik, has demonstrably never worked with distributed computing systems.

The rest is buzzwords propping up sweeping ridiculous conclusions.


"said it got too cumbersome once the size of the data reached a few petabytes."

I dont think there's a lot of companies where data would reach this huge. Anyone has any idea on how large a typical warehousing database is?


What do you mean, warehousing? Like, item tracking inside an actual warehouse? Hard to imagine spending more than a kB per unique item -- more per SKU, but less per individual object -- so even if you have 1M items being tracked, the total size would only be a gigabyte. Even if you had a billion unique things in your store, the resulting database would still fit on a single flash drive.


Given the context, it seems like "warehouse" in parent's comment was more specifically "data warehouse": http://en.m.wikipedia.org/wiki/Data_warehouse

While I couldn't quickly find anything that speaks to any kind of average or normal size of a data warehouse, this article mentions Facebook's being around 300PB: https://code.facebook.com/posts/229861827208629/scaling-the-...


Its about what they need and not what other companies need.


For those that missed it, similar discussion on this a couple of days ago: https://news.ycombinator.com/item?id=7947782


Yes, it's the same story, so we'll call this thread a dupe.


Is it just me or is using 'cloud' in their product names just shark jumping.

Seriously.


This sounds more like Google app engine's shot at amazon kinesis than anything else.


Sounds a bit like Dryad?


Yay, marketing, yay.


More snake oil? :-)




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: