Next iteration: But what if you want to count up non-fiction and fiction? There's just a slightly different "reduce" step, because people have to add up both of them. You could also break it down by subject, or author, or pretty much anything else.
Next iteration: Of course, you don't have to just add. You can do anything that has mathematical properties similar to addition (my mom teaches math at high-school). You could add up the set of all the offense words found, since addition and set addition have similar mathematical properties. You couldn't take the average number of pages, because some people will find more books than others, and you wouldn't get a fair average. But you might be able to keep track of a couple of attributes that can be summed up, and use that to calculate the average after the job is done (for example, total number of pages, and total number of books).
Now if only everyone's mom understood the concept of a linear set operation... :)
A good example that would also help someone appreciate the power of map/reduce might be something like "how many times does each letter of the alphabet appear in the library?" For the map step I take shelf 1 and you take shelf 2, but now for each book we'll have to write down something like "a:300,b:50,c:90,..." and so on. But now there are more things to "reduce" than there are books (since we now have a value for the number of occurrences of each letter in the alphabet for every book). Well, there's nothing stopping us from splitting up the reduce step as well - we could split it up by shelf again (where you add the totals for each letter on your shelf and I'll do mine, then we'll do one final reduce of at the end), or you can add up letters a-m and I'll do n-z.
I remember a Stanford lecture video that said "well, if we called it Map, Sort, Group-reduce then no-one would use it :P" but that's a minor naming issue to describe just the subtly you are exactly describing, which is to just not be entirely random in the specifics of how the MapReduce is executed.
Oh, I see. I was struck by the direct phrasing of "You couldn't take the average number of pages", which made me parse the rest of the comment incorrectly.
I have used a similar explanation using fields of wheat, combine harvesters and grain trucks.
The combine harvesters are performing a Map - they do the heavy work and actually modify the contents of the fields. The more combines you have the quicker the job gets done.
The reduce step is the trucks that collect the grain and combine it into one pile (back at the silo).
I like this analogy as it shows that some natural jobs are linearly divisible as they need no synchronisation between workers while others are more complicated due to needing to combine results.
Each combine has no need to synchronise with the others - it can be told which fields to harvest and be left to do it.
During the grain collection you can have many trucks but it soon becomes clear that there are loads of ways to organise them depending on the size of the trucks, fields and the distance to the depot. Trucks could combine loads in the fields before returning to the grain silo. There is always a danger that one truck might be left waiting for another.
This analogy also allows us to discuss whether we need to wait for all grain to be harvested before we send out the trucks and start to gather it in and whether we need the same number of trucks as combine harvesters.
The example I came up with when trying to explain the concept was pretty similar, however I framed it in a problem of finding the total value in a huge amount of coins.
Sure, any iterative process is implementable in terms of reduce so the mapping itself is a reduction, and each counting instance in the mapping is a reduction as well (unless you map on books rather than shelves, though that can be described as a reduction on a single-item set).
I think so, too, but it does not matter. The function you use on the elements in your list when doing a map can be anything, so it may even be something like counting up.
i still don't get it. it'd been clear if it was something like
"we have lot of processing to handle. we share them among computers to get the job done."
Like all good lessons, this one starts with a little lie and tells you the truth later on...
"Supriya : But how I will create different types of Chutneys?
Shekhar : Now you will see the missing phase of MapReduce — Shuffle phase. MapReduce will group all the outputs written by every Map based on the key. This will be automatically done for you. You can assume key as just a name of ingredient like Onion. So all the onion keys will be grouped together and will be transferred to a grinder which will just process onions. So, you will get onion Chutney. Similarly all the tomatoes will be transferred to the grinder marked for tomato and will produce tomato Chutney."
You see, women love cooking. The article is kind of hilarious in a chauvinistic sense, but it _does_ do a decent job of describing MapReduce. The only problem I see with simple overviews like this is that they leave out the fact that, often times, the reduce step is far more complicated than adding numbers together. This explanation pretty much applies for any basic framework for parallelism.
On a side note, I was surprised when MapReduce made such waves, as it's nothing _new_, per se. I still can't believe Google was granted a patent for it. I wrote "MapReduce"-esque algorithms in early high school when I made a distributed system to compute Pi. All it really does is provide a framework for creating, dispatching and joining jobs.
Suppose you want to count number of books for each author.
Mappers take books from shelves and bring them to shufflers, shufflers make piles of books, one pile per author, reducers compute size of each pile
The most valuable idea in that article is to to ask for permision twice before launching into an overcomplicated bad analogy to explain something your wife doesn't really care about.
There is no value in explaining a solution to someone who does not yet care about the problem.
In this case "what is mapreduce?" is looking just for an explanation of why these words are being used without their common meaning, not the whole algorithm.
I think the analogy was "meh" at best. For those of us who have any technical background, the wikipedia explanation is not bad:
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
This is nice. It's always been tough to explain quickly to friends what MapReduce does and how it extends to things like Hadoop, without extending to the abstractions with keys and values and the like.
I use an explanation using elections. Where cities, counties, states etc. (what applies in your $country) count and aggregate their values upwards until it's clear who has won on a national level. :)
It's not notable that he was able to explain it to his wife because she's a woman, it's notable because she's not in the field. This is really pretty obvious (and explicitly stated in the article).
We want to count all the books in the library. You count up shelf #1, I count up shelf #2. That's map. The more people we get, the faster it goes.
Now we get together and add our individual counts. That's reduce.
[edit: created a blog post containing this explanation. Maybe it will be useful after this HN conversation vanishes. http://crazybear.posterous.com/mapreduce-explained-in-41-wor... ]