The explanation I gave to my mom: We want to count all the books in the library....

wisty · on Aug 5, 2011

That's the first iteration of the explanation.

Next iteration: But what if you want to count up non-fiction and fiction? There's just a slightly different "reduce" step, because people have to add up both of them. You could also break it down by subject, or author, or pretty much anything else.

Next iteration: Of course, you don't have to just add. You can do anything that has mathematical properties similar to addition (my mom teaches math at high-school). You could add up the set of all the offense words found, since addition and set addition have similar mathematical properties. You couldn't take the average number of pages, because some people will find more books than others, and you wouldn't get a fair average. But you might be able to keep track of a couple of attributes that can be summed up, and use that to calculate the average after the job is done (for example, total number of pages, and total number of books).

dcosson · on Aug 6, 2011

Now if only everyone's mom understood the concept of a linear set operation... :)

A good example that would also help someone appreciate the power of map/reduce might be something like "how many times does each letter of the alphabet appear in the library?" For the map step I take shelf 1 and you take shelf 2, but now for each book we'll have to write down something like "a:300,b:50,c:90,..." and so on. But now there are more things to "reduce" than there are books (since we now have a value for the number of occurrences of each letter in the alphabet for every book). Well, there's nothing stopping us from splitting up the reduce step as well - we could split it up by shelf again (where you add the totals for each letter on your shelf and I'll do mine, then we'll do one final reduce of at the end), or you can add up letters a-m and I'll do n-z.

th0ma5 · on Aug 5, 2011

I remember a Stanford lecture video that said "well, if we called it Map, Sort, Group-reduce then no-one would use it :P" but that's a minor naming issue to describe just the subtly you are exactly describing, which is to just not be entirely random in the specifics of how the MapReduce is executed.

lgarron · on Aug 5, 2011

Why wouldn't you be able to get the average? Return page count + total pages for each shelf and then take the ratio of "reduced" sums.

michael_dorfman · on Aug 5, 2011

Re-read the comment. It discusses exactly what you are proposing.

lgarron · on Aug 7, 2011

Oh, I see. I was struck by the direct phrasing of "You couldn't take the average number of pages", which made me parse the rest of the comment incorrectly.

sambeau · on Aug 5, 2011

I have used a similar explanation using fields of wheat, combine harvesters and grain trucks.

The combine harvesters are performing a Map - they do the heavy work and actually modify the contents of the fields. The more combines you have the quicker the job gets done.

The reduce step is the trucks that collect the grain and combine it into one pile (back at the silo).

I like this analogy as it shows that some natural jobs are linearly divisible as they need no synchronisation between workers while others are more complicated due to needing to combine results.

Each combine has no need to synchronise with the others - it can be told which fields to harvest and be left to do it.

During the grain collection you can have many trucks but it soon becomes clear that there are loads of ways to organise them depending on the size of the trucks, fields and the distance to the depot. Trucks could combine loads in the fields before returning to the grain silo. There is always a danger that one truck might be left waiting for another.

This analogy also allows us to discuss whether we need to wait for all grain to be harvested before we send out the trucks and start to gather it in and whether we need the same number of trucks as combine harvesters.

Cushman · on Aug 5, 2011

MapReduce in ten seconds or less. Impressive.

kalid · on Aug 5, 2011

I love clear, succinct explanations like this. Overly complex ("rigorous") definitions are one reason we often miss the big picture. Kudos.

juliano_q · on Aug 5, 2011

The example in the blog was good enough and made me understand it quite well, but you nailed it in two lines. Very nice man, thanks.

lambada · on Aug 5, 2011

I'm stunned.

I finally get MapReduce.

I'd looked at definitions but it had never really clicked.

Thankyou, and upvoted.

figital · on Aug 5, 2011

You also could have gone to the card catalog and measured the number of cards in feet with a tape measure. That's indexing your database.

lmarinho · on Aug 5, 2011

The example I came up with when trying to explain the concept was pretty similar, however I framed it in a problem of finding the total value in a huge amount of coins.

vegai · on Aug 5, 2011

"You count up shelf #1, I count up shelf #2."

Isn't that a reduce too, though?

masklinn · on Aug 5, 2011

Sure, any iterative process is implementable in terms of reduce so the mapping itself is a reduction, and each counting instance in the mapping is a reduction as well (unless you map on books rather than shelves, though that can be described as a reduction on a single-item set).

rogerbraun · on Aug 5, 2011

I think so, too, but it does not matter. The function you use on the elements in your list when doing a map can be anything, so it may even be something like counting up.

yummyfajitas · on Aug 5, 2011

In my analogy, I was assuming a shelf is a single record.

evilswan · on Aug 5, 2011

This is even better.

thebootstrapper · on Aug 5, 2011

Impressive.

Qa8BBatwHxK8Pu · on Aug 5, 2011

i still don't get it. it'd been clear if it was something like "we have lot of processing to handle. we share them among computers to get the job done."

bmurphy · on Aug 5, 2011

Nice. Much better.

pyrhho · on Aug 5, 2011

I wish I could upvote you more...