<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>So Jake Says &#187; distributed computing</title>
	<atom:link href="http://www.jakevoytko.com/blog/tag/distributed-computing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.jakevoytko.com/blog</link>
	<description>Ye Olde Computer Science Blogge</description>
	<lastBuildDate>Sun, 17 Jan 2010 15:16:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>MacGyver&#8217;s MapReduce in Python! Part 1: Theory</title>
		<link>http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/</link>
		<comments>http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/#comments</comments>
		<pubDate>Sun, 17 Feb 2008 19:21:22 +0000</pubDate>
		<dc:creator>Jake</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[Macgyver's MapReduce]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/</guid>
		<description><![CDATA[Scope of This Article Beginner / Intermediate. Assumes some knowledge of basic computer science. If you&#8217;re looking for some serious meat and potatoes, come back next week. If you&#8217;re looking for an article that isn&#8217;t about MapReduce because you&#8217;re sick of hearing about it, come back next month. Introductory B.S. TCNJ has about 20 Linux [...]]]></description>
			<content:encoded><![CDATA[<h2>Scope of This Article</h2>
<p>Beginner / Intermediate. Assumes some knowledge of basic computer science. If you&#8217;re looking for some serious meat and potatoes, come back next week. If you&#8217;re looking for an article that isn&#8217;t about <em>MapReduce </em>because you&#8217;re sick of hearing about it, come back next month.</p>
<h2>Introductory B.S.</h2>
<p>TCNJ has about 20 Linux and 30 Unix machines in computer labs, but a HUGE chunk of their time is spent idling. The labs, for the most part, are pretty sparsely used.</p>
<p>I&#8217;ve been looking for a good professor-excused reason to do some major cluster computing, and finally my time has come! Another student and I are doing cryptography research to design a <a href="http://csrc.nist.gov/groups/ST/hash/sha-3/index.html">new hash algorithm</a> (blog post to follow after Third Quarter 2008!).</p>
<p>Guess what cryptographic hash algorithms need? Lots and lots and lots of statistical testing of millions of separate hashes. It would take my laptop an unfortunate amount of time to process dozens of tests with millions of records even if we <em>could</em> parse 100mBps of data with our design. God help us when we want to change something: all of the testing needs to begin all over again. Yuck!</p>
<p>This situation sounds ideal for <em><a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a></em>. Why should I use my own machine when I can make other machines do it? I am a geek, purveyor of the computer lab, master and commander of all that lay before me. Let&#8217;s put those things to work!</p>
<p>While I&#8217;m on the &#8220;Looking For An Excuse&#8221; train, I&#8217;ve played around with Python extensively in the past and never produced anything substantial. I&#8217;ve always been impressed with the speed of which I&#8217;ve been able to make things in Python, and in this situation, &#8220;Batteries Included&#8221; will help me immensely.</p>
<h2>So what&#8217;s the catch?</h2>
<p>I have absolutely no control over what software is on any of the machines. I get to decide between Java 1.5, Python 2.4, and gcc/g++ 3.4.1. I&#8217;m not allowed to run as root on any of the machines. Any task needs to be run as the user, so I need to be logged in to the machines when the commands are run (if I&#8217;m not right about that, I&#8217;d love to hear it, but every way I&#8217;ve found so far to run a job in the future requires the user to be logged in).</p>
<p>The consolation prize is that they are all pretty beefy, with decent hard drive space (most empty!), 2GB of memory, and dual-core AMD processors.</p>
<h2>System Architecture And Design</h2>
<p>The machines are behind a central server (<em>Named &#8220;TomHanks&#8221; for the sake of discussion&#8230; this is not the actual name of the computer</em><em>)</em> that also doesn&#8217;t do much. The machines don&#8217;t have any contact through the outside world except through <em>TomHanks</em>. They are not all necessarily in the same room, and they are not necessarily running the same operating system.</p>
<p><a title="networkarch1.png" rel="attachment wp-att-53" href="http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/attachment/53/"><img src="http://www.jakevoytko.com/blog/wp-content/uploads/2008/02/networkarch1.png" alt="networkarch1.png" /></a></p>
<p>All of the computers are on the same architecture, and do have access to the same shared drive. However, my space restrictions on the shared drive are tiny, making any kind of shared storage impractical in the general case. However, this does provide interesting opportunities, as my school-given website is accessible through here, making report-generation a possibility. On the other hand, the important part of my semester is the research I&#8217;m doing, not the actual MapReduce client, so I&#8217;m going to try to keep it as simple as possible.</p>
<h2>Abstract Design of <em>MapReduce</em></h2>
<p>The idea of <em>MapReduce</em> is beautifully simple, and takes simple ideas from functional paradigms and applies them to the world of distributed computing. Let&#8217;s say that we have a list of data that we want to hash. Picking an arbitrary list, we&#8217;ll use<em> [1, 2, 3, 4, ... n].</em></p>
<p>First, we apply the<em> map()</em> function to every single value of the array. Assuming our function is named <em>f</em>, our result is <em>[f(1), f(2), f(3), ..., f(n)]</em></p>
<blockquote><p><strong>Aside: What is &#8220;map&#8221;?<br />
</strong></p>
<p>When we &#8220;map&#8221; something, all we are doing is applying the same function to every element of a list, and storing the results in another list. For example, if we have the list <em>[1,2,3,4]</em>, and we apply the function<em> f(x)=2x </em>to this list, we get the list<em> [2,4,6,8]</em>.</p>
<p>In this example, it&#8217;s obvious how individual elements were transformed, but that is not always the case.</p>
<p>What are the constraints on the mappings? For my purpose, it works out best if the map is not always one-to-one, meaning that each time we evaluate the function, we can choose to return nothing, one value, or more values. It *can* be one-to-one, and for some tasks, it is necessary.</p></blockquote>
<p>Then, we take the resulting array and evaluate a &#8220;reduce&#8221; function.</p>
<blockquote><p><strong>Aside: What is &#8220;reduce&#8221;?<br />
</strong><br />
&#8220;Reduce&#8221; is perhaps the least obvious name to pick. Other names for this idea are &#8220;fold&#8221; and &#8220;accumulate&#8221;.</p>
<p>When we &#8220;reduce&#8221;, we are just combining all of the values in a list to a single value by applying some function.</p>
<p>How do we reduce the list <em>[1,2,3,4]</em> using addition? <em>1+2+3+4 = 10</em>. We could also have converted it into a CSV string: <em>&#8220;1,2,3,4&#8243;</em> or &#8220;4,3,2,1&#8243;. There are, of course, subtleties that I am skipping. If you would like more of them, please go <a href="http://en.wikipedia.org/wiki/Reduce_%28higher-order_function%29">here</a>.</p></blockquote>
<p>And there you have it, <em>MapReduce</em>!</p>
<h2>That&#8217;s It?</h2>
<p>As you would expect, we need to make alterations (either above the hood or under the hood) in order to better apply this to the real world.</p>
<p>For our real model, we will borrow from Google&#8217;s design. Before I read <a href="http://labs.google.com/papers/mapreduce.html">Google&#8217;s paper on MapReduce</a>, I drew out how I thought that the model should work, and what I called <em>PreReduce</em> they called a <em>Combiner</em>, which I like better. It allows us, in certain cases, to do some of the heavy lifting of the <em>Reduce</em> on the satellite machines instead of the client machine.</p>
<p>For example, if we were trying to add the numbers from 1 to 100 trillion, why would we bother passing back the numbers <em>[1, 2, 3, 4, ... n]</em> for the host machine to add together? Addition of integers is commutative and associative, so we can add them on the satellite machine and then send back one large integer instead of an arbitrary number of small integers.</p>
<h2>What is our high-level algorithm?</h2>
<p>Ideally, what do we want a client to do?</p>
<ul>
<li>Get Work</li>
<li>Map()</li>
<li>Combine()</li>
<li>Return Work.</li>
<li>Reply to &#8220;Are you alive?&#8221; pings.</li>
<li>Securely send results (as I&#8217;m sure my friends will try to send LoLcat messages to home base).</li>
</ul>
<p>What do we want a server to do?</p>
<ul>
<li>Start all satellite machines.</li>
<li>Respond to requests for data.</li>
<li>Perform <em>Reduce</em></li>
<li>Occasionally check to see which machines are alive.</li>
<li>Generate progress reports (optional).</li>
<li>Track responses and keep a &#8220;Shit List&#8221; of computers that give a few bad responses (throwing the computer in jail for a few minutes and then trusting it again is probably the most practical).</li>
</ul>
<p>So what can we do at a high level to get all of this rolling?</p>
<p><strong>The <em>MapReduce</em> Process:</strong></p>
<ol>
<li><strong>Start a central server that will dole out work tasks</strong>.</li>
<li><strong>The central server starts the satellite machines</strong>. Fortunately, all of the machines share a network drive, so this is a relatively trivial task involving a little SSH action.</li>
<li><strong>The satellite machines phone home for work</strong>. Why do we do it this way instead of letting the central server give work with the start commands? In short, a friend of mine thinks it&#8217;s hysterical to fork-bomb a machine every now and then (and sometimes war erupts during class), so I REALLY can&#8217;t guarantee that ANY of the machines will be working at any given point. If the machines phone home for work, we are more likely to have work units parsed out sooner.</li>
<li><strong>The satellite machines perform map() on every input value. </strong>It is obvious that the satellite machines should be doing the heavy lifting of the map() function. There is no law that says that the resulting map has to be stored in one place, and in this case, it will be stored in 50 places for the time being.</li>
<li><strong>The satellite machines perform combine() on every input value.</strong> As mentioned above, this lets us partition off some of the work of the <em>Reduce</em> function to the satellite machines.</li>
<li><strong>The satellite machines return the results to the central machine.</strong></li>
<li><strong>The central machine pings satellite machines that are long in responding.</strong></li>
<li><strong>Repeat Steps 3-6 until all work units are done.</strong></li>
<li><strong>The central server performs <em>Reduce.</em></strong></li>
</ol>
<h2>What&#8217;s next?</h2>
<p>Next week we will look into how we will organize the SSH connections, how we will login to the machines, and how we will trust &#8220;secure&#8221; answers when the satellites don&#8217;t have any kind of built in cryptography.</p>
<img src="http://www.jakevoytko.com/blog/?ak_action=api_record_view&id=51&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
