MacGyver’s MapReduce in Python! Part 1: Theory
Scope of This Article
Beginner / Intermediate. Assumes some knowledge of basic computer science. If you’re looking for some serious meat and potatoes, come back next week. If you’re looking for an article that isn’t about MapReduce because you’re sick of hearing about it, come back next month.
Introductory B.S.
TCNJ has about 20 Linux and 30 Unix machines in computer labs, but a HUGE chunk of their time is spent idling. The labs, for the most part, are pretty sparsely used.
I’ve been looking for a good professor-excused reason to do some major cluster computing, and finally my time has come! Another student and I are doing cryptography research to design a new hash algorithm (blog post to follow after Third Quarter 2008!).
Guess what cryptographic hash algorithms need? Lots and lots and lots of statistical testing of millions of separate hashes. It would take my laptop an unfortunate amount of time to process dozens of tests with millions of records even if we could parse 100mBps of data with our design. God help us when we want to change something: all of the testing needs to begin all over again. Yuck!
This situation sounds ideal for MapReduce. Why should I use my own machine when I can make other machines do it? I am a geek, purveyor of the computer lab, master and commander of all that lay before me. Let’s put those things to work!
While I’m on the “Looking For An Excuse” train, I’ve played around with Python extensively in the past and never produced anything substantial. I’ve always been impressed with the speed of which I’ve been able to make things in Python, and in this situation, “Batteries Included” will help me immensely.
So what’s the catch?
I have absolutely no control over what software is on any of the machines. I get to decide between Java 1.5, Python 2.4, and gcc/g++ 3.4.1. I’m not allowed to run as root on any of the machines. Any task needs to be run as the user, so I need to be logged in to the machines when the commands are run (if I’m not right about that, I’d love to hear it, but every way I’ve found so far to run a job in the future requires the user to be logged in).
The consolation prize is that they are all pretty beefy, with decent hard drive space (most empty!), 2GB of memory, and dual-core AMD processors.
System Architecture And Design
The machines are behind a central server (Named “TomHanks” for the sake of discussion… this is not the actual name of the computer) that also doesn’t do much. The machines don’t have any contact through the outside world except through TomHanks. They are not all necessarily in the same room, and they are not necessarily running the same operating system.
All of the computers are on the same architecture, and do have access to the same shared drive. However, my space restrictions on the shared drive are tiny, making any kind of shared storage impractical in the general case. However, this does provide interesting opportunities, as my school-given website is accessible through here, making report-generation a possibility. On the other hand, the important part of my semester is the research I’m doing, not the actual MapReduce client, so I’m going to try to keep it as simple as possible.
Abstract Design of MapReduce
The idea of MapReduce is beautifully simple, and takes simple ideas from functional paradigms and applies them to the world of distributed computing. Let’s say that we have a list of data that we want to hash. Picking an arbitrary list, we’ll use [1, 2, 3, 4, ... n].
First, we apply the map() function to every single value of the array. Assuming our function is named f, our result is [f(1), f(2), f(3), ..., f(n)]
Aside: What is “map”?
When we “map” something, all we are doing is applying the same function to every element of a list, and storing the results in another list. For example, if we have the list [1,2,3,4], and we apply the function f(x)=2x to this list, we get the list [2,4,6,8].
In this example, it’s obvious how individual elements were transformed, but that is not always the case.
What are the constraints on the mappings? For my purpose, it works out best if the map is not always one-to-one, meaning that each time we evaluate the function, we can choose to return nothing, one value, or more values. It *can* be one-to-one, and for some tasks, it is necessary.
Then, we take the resulting array and evaluate a “reduce” function.
Aside: What is “reduce”?
“Reduce” is perhaps the least obvious name to pick. Other names for this idea are “fold” and “accumulate”.When we “reduce”, we are just combining all of the values in a list to a single value by applying some function.
How do we reduce the list [1,2,3,4] using addition? 1+2+3+4 = 10. We could also have converted it into a CSV string: “1,2,3,4″ or “4,3,2,1″. There are, of course, subtleties that I am skipping. If you would like more of them, please go here.
And there you have it, MapReduce!
That’s It?
As you would expect, we need to make alterations (either above the hood or under the hood) in order to better apply this to the real world.
For our real model, we will borrow from Google’s design. Before I read Google’s paper on MapReduce, I drew out how I thought that the model should work, and what I called PreReduce they called a Combiner, which I like better. It allows us, in certain cases, to do some of the heavy lifting of the Reduce on the satellite machines instead of the client machine.
For example, if we were trying to add the numbers from 1 to 100 trillion, why would we bother passing back the numbers [1, 2, 3, 4, ... n] for the host machine to add together? Addition of integers is commutative and associative, so we can add them on the satellite machine and then send back one large integer instead of an arbitrary number of small integers.
What is our high-level algorithm?
Ideally, what do we want a client to do?
- Get Work
- Map()
- Combine()
- Return Work.
- Reply to “Are you alive?” pings.
- Securely send results (as I’m sure my friends will try to send LoLcat messages to home base).
What do we want a server to do?
- Start all satellite machines.
- Respond to requests for data.
- Perform Reduce
- Occasionally check to see which machines are alive.
- Generate progress reports (optional).
- Track responses and keep a “Shit List” of computers that give a few bad responses (throwing the computer in jail for a few minutes and then trusting it again is probably the most practical).
So what can we do at a high level to get all of this rolling?
The MapReduce Process:
- Start a central server that will dole out work tasks.
- The central server starts the satellite machines. Fortunately, all of the machines share a network drive, so this is a relatively trivial task involving a little SSH action.
- The satellite machines phone home for work. Why do we do it this way instead of letting the central server give work with the start commands? In short, a friend of mine thinks it’s hysterical to fork-bomb a machine every now and then (and sometimes war erupts during class), so I REALLY can’t guarantee that ANY of the machines will be working at any given point. If the machines phone home for work, we are more likely to have work units parsed out sooner.
- The satellite machines perform map() on every input value. It is obvious that the satellite machines should be doing the heavy lifting of the map() function. There is no law that says that the resulting map has to be stored in one place, and in this case, it will be stored in 50 places for the time being.
- The satellite machines perform combine() on every input value. As mentioned above, this lets us partition off some of the work of the Reduce function to the satellite machines.
- The satellite machines return the results to the central machine.
- The central machine pings satellite machines that are long in responding.
- Repeat Steps 3-6 until all work units are done.
- The central server performs Reduce.
What’s next?
Next week we will look into how we will organize the SSH connections, how we will login to the machines, and how we will trust “secure” answers when the satellites don’t have any kind of built in cryptography.
Popularity: 52% [?]


Reader Comments
Have you thought about building a cluster boot cd? If you could have the workstations boot from a CD, even as simple as DSL or knoppix, then download tasks from the server?
This would allow full control of the pc, using the 2 gig of ram as temporary storage, very little concern for problems with the cluster, and you could burn as many CD’s as you want. If you could boot 3 labs on campus on a saturday night and leave them running until monday morning that would probably be enough power for anything you are trying.
For example:
http://bccd.cs.uni.edu/
“The BCCD was created to facilitate instruction of parallel computing aspects and paradigms. Part of the difficulty instructors face is lack of dedicated resources to explore distributed computing aspects lack of time to preconfigure and test the supporting environment. The BCCD image addresses this problem by providing a non-destructive overlay way to run a full-fledged parallel computing environment on just about any workstation-class system…We’re happy to say that this now includes the MAC too!”
This would be very interesting to see your results, and the “super computing rating” you get when the cluster is running.
you could see if you can run it as a crontab ^.^ then you would not have to be logged in….
@Bradford,
Unfortunately, students and faculty are able to use the machines via physical access (as well as SSH) all weekend long, so this is not an option. I looked at a few options like this, and unfortunately, all of them weren’t good enough. Unfortunately, I’m stuck with “MacGyver’s MapReduce” because I’m trying to make a distributed computing environment with extreme limitations.
Ever consider using BOINC? You can get free CPU time on thousands of machines.
Plan:
1. Setup a BOINC server.
2. use your application (your new hash algorithm) to demonstrate BOINC to other researchers at your school
3. get a job at school administering the BOINC server and helping other researchers write their apps to run on BOINC
http://boinc.berkeley.edu
@rainfay: It looks like Crontab will work. Thanks!