Hacker News new | past | comments | ask | show | jobs | submit login

The explanation I gave to my mom:

We want to count all the books in the library. You count up shelf #1, I count up shelf #2. That's map. The more people we get, the faster it goes.

Now we get together and add our individual counts. That's reduce.

[edit: created a blog post containing this explanation. Maybe it will be useful after this HN conversation vanishes. http://crazybear.posterous.com/mapreduce-explained-in-41-wor... ]




That's the first iteration of the explanation.

Next iteration: But what if you want to count up non-fiction and fiction? There's just a slightly different "reduce" step, because people have to add up both of them. You could also break it down by subject, or author, or pretty much anything else.

Next iteration: Of course, you don't have to just add. You can do anything that has mathematical properties similar to addition (my mom teaches math at high-school). You could add up the set of all the offense words found, since addition and set addition have similar mathematical properties. You couldn't take the average number of pages, because some people will find more books than others, and you wouldn't get a fair average. But you might be able to keep track of a couple of attributes that can be summed up, and use that to calculate the average after the job is done (for example, total number of pages, and total number of books).


Now if only everyone's mom understood the concept of a linear set operation... :)

A good example that would also help someone appreciate the power of map/reduce might be something like "how many times does each letter of the alphabet appear in the library?" For the map step I take shelf 1 and you take shelf 2, but now for each book we'll have to write down something like "a:300,b:50,c:90,..." and so on. But now there are more things to "reduce" than there are books (since we now have a value for the number of occurrences of each letter in the alphabet for every book). Well, there's nothing stopping us from splitting up the reduce step as well - we could split it up by shelf again (where you add the totals for each letter on your shelf and I'll do mine, then we'll do one final reduce of at the end), or you can add up letters a-m and I'll do n-z.


I remember a Stanford lecture video that said "well, if we called it Map, Sort, Group-reduce then no-one would use it :P" but that's a minor naming issue to describe just the subtly you are exactly describing, which is to just not be entirely random in the specifics of how the MapReduce is executed.


Why wouldn't you be able to get the average? Return page count + total pages for each shelf and then take the ratio of "reduced" sums.


Re-read the comment. It discusses exactly what you are proposing.


Oh, I see. I was struck by the direct phrasing of "You couldn't take the average number of pages", which made me parse the rest of the comment incorrectly.


I have used a similar explanation using fields of wheat, combine harvesters and grain trucks.

The combine harvesters are performing a Map - they do the heavy work and actually modify the contents of the fields. The more combines you have the quicker the job gets done.

The reduce step is the trucks that collect the grain and combine it into one pile (back at the silo).

I like this analogy as it shows that some natural jobs are linearly divisible as they need no synchronisation between workers while others are more complicated due to needing to combine results.

Each combine has no need to synchronise with the others - it can be told which fields to harvest and be left to do it.

During the grain collection you can have many trucks but it soon becomes clear that there are loads of ways to organise them depending on the size of the trucks, fields and the distance to the depot. Trucks could combine loads in the fields before returning to the grain silo. There is always a danger that one truck might be left waiting for another.

This analogy also allows us to discuss whether we need to wait for all grain to be harvested before we send out the trucks and start to gather it in and whether we need the same number of trucks as combine harvesters.


MapReduce in ten seconds or less. Impressive.


I love clear, succinct explanations like this. Overly complex ("rigorous") definitions are one reason we often miss the big picture. Kudos.


The example in the blog was good enough and made me understand it quite well, but you nailed it in two lines. Very nice man, thanks.


I'm stunned.

I finally get MapReduce.

I'd looked at definitions but it had never really clicked.

Thankyou, and upvoted.


You also could have gone to the card catalog and measured the number of cards in feet with a tape measure. That's indexing your database.


The example I came up with when trying to explain the concept was pretty similar, however I framed it in a problem of finding the total value in a huge amount of coins.


"You count up shelf #1, I count up shelf #2."

Isn't that a reduce too, though?


Sure, any iterative process is implementable in terms of reduce so the mapping itself is a reduction, and each counting instance in the mapping is a reduction as well (unless you map on books rather than shelves, though that can be described as a reduction on a single-item set).


I think so, too, but it does not matter. The function you use on the elements in your list when doing a map can be anything, so it may even be something like counting up.


In my analogy, I was assuming a shelf is a single record.


This is even better.


Impressive.


i still don't get it. it'd been clear if it was something like "we have lot of processing to handle. we share them among computers to get the job done."


Nice. Much better.


I wish I could upvote you more...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: