<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>So Jake Says &#187; Macgyver&#8217;s MapReduce</title>
	<atom:link href="http://www.jakevoytko.com/blog/tag/macgyvers-mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.jakevoytko.com/blog</link>
	<description>Ye Olde Computer Science Blogge</description>
	<lastBuildDate>Sun, 17 Jan 2010 15:16:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>MacGyver&#8217;s MapReduce in Python! Part 2: Local Security</title>
		<link>http://www.jakevoytko.com/blog/2008/02/24/macgyvers-mapreduce-in-python-part-2-local-security/</link>
		<comments>http://www.jakevoytko.com/blog/2008/02/24/macgyvers-mapreduce-in-python-part-2-local-security/#comments</comments>
		<pubDate>Sun, 24 Feb 2008 22:00:42 +0000</pubDate>
		<dc:creator>Jake</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[cryptographic hash functions]]></category>
		<category><![CDATA[hash functions]]></category>
		<category><![CDATA[Macgyver's MapReduce]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://www.jakevoytko.com/blog/2008/02/24/macgyvers-mapreduce-in-python-part-2-local-security/</guid>
		<description><![CDATA[Scope of the Article Intermediate. This is a security discussion, but Python will be used. A basic overview of cryptographic hash functions will also be given. Note that the lowest version of Python is 2.3 (and I have no power to change this), so I may be restricted from performing currently optimal solutions. Feel free [...]]]></description>
			<content:encoded><![CDATA[<h2>Scope of the Article</h2>
<p>Intermediate. This is a security discussion, but Python will be used. A basic overview of cryptographic hash functions will also be given.</p>
<p>Note that the lowest version of Python is 2.3 (and I have no power to change this), so I may be restricted from performing currently optimal solutions. Feel free to make suggestions anyways.</p>
<p>For a refresher on the <em>MapReduce</em> process, check out <a href="http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/">Part 1</a>.</p>
<h2>What I&#8217;m Up Against Storing Data Locally</h2>
<p><strong>Worker computers will be attacked.</strong> The people who I share the lab with are my friends, and they will therefore try to do everything they can to mess my day up. Forkbombs and unplugging are probably not out of the question. My friends skills run the gamut from *nix familiarity to mastery of college-level undergrad mathematics and cryptography courses.</p>
<p><strong>Attackers have physical access to the machines.</strong> All of the machines are in computer labs. Enough said.</p>
<p><strong>Very small shared drive between computers.</strong> The shared drive is large enough to store the final results, but not intermediate calculations.</p>
<p><strong>All computers have different environments</strong>, almost none of which I have control over. My personal laptop is an Intel Core 2 Duo with 1.6mHz, and I do have full control of what software it runs. The head machine in the network is a Blade server running Solaris, and it has Python 2.4. The worker machines are Linux machines running Python 2.3.</p>
<p>For the purposes of this exercise, I am considering my own area of the shared network drive secure. Anything that is not taking place within the computer&#8217;s memory and within my own shared network drive will be considered &#8220;secure&#8221;, the truth be damned.</p>
<h2>Storing Data on Worker Machines</h2>
<p>As I mentioned above, there is not enough room for me to store intermediate data in the shared network drive, so I need to take advantage of the storage capacity of the worker machines. All of the worker machines are sitting right in a computer lab. These labs see mild-to-moderate use every day, so I can&#8217;t just use a live CD to start a distributed computing program.</p>
<p>Since the local drives of the machines are open for general use, I need to make a strategy to protect my data. First, let&#8217;s look at the naive way to store <em>MapReduce</em> data:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> do_work<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
    <span style="color: #808080; font-style: italic;"># The only commas are guaranteed to be separators.</span>
    <span style="color: #808080; font-style: italic;"># The work splitting is simplified for this example.</span>
    todo = <span style="color: #008000;">self</span>.<span style="color: black;">work</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">','</span><span style="color: black;">&#41;</span>
    temp_file = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">tmpfile</span>, <span style="color: #483d8b;">'w'</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> item <span style="color: #ff7700;font-weight:bold;">in</span> todo:
         result = <span style="color: #008000;">str</span><span style="color: black;">&#40;</span>mapreduce.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>item<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
         temp_file.<span style="color: black;">write</span><span style="color: black;">&#40;</span>result + <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
    temp_file.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>This gets the job done, but this could give me some serious headaches. What&#8217;s to stop someone from potentially altering, inserting, or erasing messages (permissions aside)? We need something called <em>Cryptographic Hash Functions</em> for added security<em>.</em></p>
<h2>Cryptographic Hash Functions</h2>
<p>A hash function, in general, takes an arbitrary length message and converts it to a <em>digest</em> of a fixed size. There is no guarantee as to whether or not an attacker can determine the original message, or create a second message with the same digest.</p>
<p>A <em>Cryptographic Hash Function</em> is a hash function with the following properties:</p>
<p><strong>Preimage Resistance</strong>: Let <em>hash(message_a) = digest_a. </em>If the attacker is given <em>digest_a</em>, can <em>message_a</em> be determined? If a hash digest is <em>n</em> bits, the preimage resistance is expected to be <img src='/blog/wp-content/plugins/latexrender/pictures/9aa0ec0374c89d2f7f3d9cd2e05a4bc5_1.0pt.gif' title='2^{n}' alt='2^{n}'  style="vertical-align:-1.0pt;" >.</p>
<p><strong>Second Preimage Resistance:</strong> If the attacker has a message, <em>message_a</em>, it should be hard to find <em>message_b</em> so that <em>hash(message_a)=hash(message_b)</em>. If a hash digest is <em>n</em> bits, the preimage resistance is expected to be <img src='/blog/wp-content/plugins/latexrender/pictures/9aa0ec0374c89d2f7f3d9cd2e05a4bc5_1.0pt.gif' title='2^{n}' alt='2^{n}'  style="vertical-align:-1.0pt;" >.</p>
<p><strong>Collision Resistance:</strong> It should be hard for the attacker to make <em>message_a</em> and <em>message_b</em> so that <em>hash(message_a) = hash(message_b).</em> This is expected to have security of <img src='/blog/wp-content/plugins/latexrender/pictures/66512b246461090416f308a84b08c768_1.0pt.gif' title='2^{n/2}' alt='2^{n/2}'  style="vertical-align:-1.0pt;" >.</p>
<p><strong>Message extension resistance:</strong> It should be hard for the attacker, when given <em>hash(message_a)</em>, to easily calculate any possible extension <em>hash(message_a || message_b)</em>, where &#8220;||&#8221; is concatenation.<br />
A few examples of cryptographic hash functions:</p>
<ul>
<li>MD2, MD4, MD5 (Designed by Ron Rivest)</li>
<li>Whirlpool (Designed by the makers of AES)</li>
<li>SHA-0, SHA-1, SHA-224/256, SHA-384/512 (Designed by NiST)</li>
<li>Elliptic Curve Cryptography (Designed by Lenstra, et. al)</li>
</ul>
<p>It should be noted that the MD series is currently considered broken, it is now considered &#8220;practical&#8221; to find collisions of SHA-0 with beefy resources. SHA-1 has a theoretical collision attack by a group of Chinese researchers (Wang, Yin, Yu).</p>
<h2>Better Data Storage on Worker Machines</h2>
<p>I will use Python 2.3&#8242;s built-in SHA-0 digest. &#8220;But Jake,&#8221; I hear you say, &#8220;you said that it was insecure! Someone might alter one of your messages!&#8221; Let&#8217;s look at the tradeoff here: The Wang, Yin, Yu attack takes theoretically <img src='/blog/wp-content/plugins/latexrender/pictures/26f5c7401cdc59acc94842661057823d_1.0pt.gif' title='2^{33}' alt='2^{33}'  style="vertical-align:-1.0pt;" > operations to complete on average. It&#8217;s not out of the realm of possibility to find a collision on an overpowered home computer in a few weeks, or on a cluster within a few days.</p>
<p>However, all of my data results are going to exist on local machines for a few hours, max, before the computation completes and work is sent home. Even if someone wanted to devote the time and energy to find a collision, they would be altering a single data point, when there could potentially be millions. The attackers would either be creating a completely bad data point that can be sanity checked, or they will be spending all of the computation power to throw off my result by a fraction. Needless to say, I&#8217;m more worried about losing large sets of data than an attacker adding random messages.</p>
<p>I will be using a method of hashing called &#8220;salting&#8221; where an extra value is added to the hash to help further disguise the message. This is a common method for storing passwords. As long as the salt isn&#8217;t short, the attacker should be unable to create a collision in reasonable time. I will be using a large &#8220;nonce&#8221;, (or <strong>n</strong>umber used <strong>once</strong>).</p>
<p><strong>Note:</strong> I could use encryption for this task, especially with the freely available <em>crypto</em> libraries for Python, but the headache of getting them installed on all of the computers is not what my research is all about. Since it is irrelevant if attackers read the information, I think that it can be shown that encryption and my scheme are both equivalent to finding the encryption key / hash salt of these one-way functions. If you don&#8217;t agree, I&#8217;d really like to know why.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> do_work<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
    <span style="color: #808080; font-style: italic;"># The only commas are guaranteed to be separators.</span>
    <span style="color: #808080; font-style: italic;"># The work splitting is simplified for this example.</span>
    todo = <span style="color: #008000;">self</span>.<span style="color: black;">work</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">','</span><span style="color: black;">&#41;</span>
    temp_file = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">tmpfile</span>, <span style="color: #483d8b;">'w'</span><span style="color: black;">&#41;</span>
    i = <span style="color: #ff4500;">0</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> item <span style="color: #ff7700;font-weight:bold;">in</span> todo:
         result = <span style="color: #008000;">str</span><span style="color: black;">&#40;</span>mapreduce.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>item<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
         sha_obj = <span style="color: #dc143c;">sha</span>.<span style="color: #dc143c;">new</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">salt</span>+result+<span style="color: #008000;">str</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
         t = <span style="color: #008000;">str</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span>+<span style="color: #483d8b;">','</span>+result+<span style="color: #483d8b;">','</span>+sha_obj.<span style="color: black;">hexdigest</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
         temp_file.<span style="color: black;">write</span><span style="color: black;">&#40;</span>t + <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
         i+=<span style="color: #ff4500;">1</span>
    temp_file.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>The validation code is also simple:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> extract_value<span style="color: black;">&#40;</span>line, salt<span style="color: black;">&#41;</span>:
    values = line.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">','</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>values<span style="color: black;">&#41;</span> <span style="color: #66cc66;">!</span>= <span style="color: #ff4500;">3</span>:
        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
    t = salt+values<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>+values<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    sha_obj = <span style="color: #dc143c;">sha</span>.<span style="color: #dc143c;">new</span><span style="color: black;">&#40;</span>t<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> sha_obj.<span style="color: black;">hexdigest</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">!</span>= values<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Hax&quot;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> values<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span></pre></div></div>

<h2>What’s next?</h2>
<p>Next week we will look into network transmissions of the data. Yes, that post is likely to be a lot like this one. Despite my wishes to the contrary, I need to spend more time researching than blogging.</p>
<img src="http://www.jakevoytko.com/blog/?ak_action=api_record_view&id=54&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.jakevoytko.com/blog/2008/02/24/macgyvers-mapreduce-in-python-part-2-local-security/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MacGyver&#8217;s MapReduce in Python! Part 1: Theory</title>
		<link>http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/</link>
		<comments>http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/#comments</comments>
		<pubDate>Sun, 17 Feb 2008 19:21:22 +0000</pubDate>
		<dc:creator>Jake</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[Macgyver's MapReduce]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/</guid>
		<description><![CDATA[Scope of This Article Beginner / Intermediate. Assumes some knowledge of basic computer science. If you&#8217;re looking for some serious meat and potatoes, come back next week. If you&#8217;re looking for an article that isn&#8217;t about MapReduce because you&#8217;re sick of hearing about it, come back next month. Introductory B.S. TCNJ has about 20 Linux [...]]]></description>
			<content:encoded><![CDATA[<h2>Scope of This Article</h2>
<p>Beginner / Intermediate. Assumes some knowledge of basic computer science. If you&#8217;re looking for some serious meat and potatoes, come back next week. If you&#8217;re looking for an article that isn&#8217;t about <em>MapReduce </em>because you&#8217;re sick of hearing about it, come back next month.</p>
<h2>Introductory B.S.</h2>
<p>TCNJ has about 20 Linux and 30 Unix machines in computer labs, but a HUGE chunk of their time is spent idling. The labs, for the most part, are pretty sparsely used.</p>
<p>I&#8217;ve been looking for a good professor-excused reason to do some major cluster computing, and finally my time has come! Another student and I are doing cryptography research to design a <a href="http://csrc.nist.gov/groups/ST/hash/sha-3/index.html">new hash algorithm</a> (blog post to follow after Third Quarter 2008!).</p>
<p>Guess what cryptographic hash algorithms need? Lots and lots and lots of statistical testing of millions of separate hashes. It would take my laptop an unfortunate amount of time to process dozens of tests with millions of records even if we <em>could</em> parse 100mBps of data with our design. God help us when we want to change something: all of the testing needs to begin all over again. Yuck!</p>
<p>This situation sounds ideal for <em><a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a></em>. Why should I use my own machine when I can make other machines do it? I am a geek, purveyor of the computer lab, master and commander of all that lay before me. Let&#8217;s put those things to work!</p>
<p>While I&#8217;m on the &#8220;Looking For An Excuse&#8221; train, I&#8217;ve played around with Python extensively in the past and never produced anything substantial. I&#8217;ve always been impressed with the speed of which I&#8217;ve been able to make things in Python, and in this situation, &#8220;Batteries Included&#8221; will help me immensely.</p>
<h2>So what&#8217;s the catch?</h2>
<p>I have absolutely no control over what software is on any of the machines. I get to decide between Java 1.5, Python 2.4, and gcc/g++ 3.4.1. I&#8217;m not allowed to run as root on any of the machines. Any task needs to be run as the user, so I need to be logged in to the machines when the commands are run (if I&#8217;m not right about that, I&#8217;d love to hear it, but every way I&#8217;ve found so far to run a job in the future requires the user to be logged in).</p>
<p>The consolation prize is that they are all pretty beefy, with decent hard drive space (most empty!), 2GB of memory, and dual-core AMD processors.</p>
<h2>System Architecture And Design</h2>
<p>The machines are behind a central server (<em>Named &#8220;TomHanks&#8221; for the sake of discussion&#8230; this is not the actual name of the computer</em><em>)</em> that also doesn&#8217;t do much. The machines don&#8217;t have any contact through the outside world except through <em>TomHanks</em>. They are not all necessarily in the same room, and they are not necessarily running the same operating system.</p>
<p><a title="networkarch1.png" rel="attachment wp-att-53" href="http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/attachment/53/"><img src="http://www.jakevoytko.com/blog/wp-content/uploads/2008/02/networkarch1.png" alt="networkarch1.png" /></a></p>
<p>All of the computers are on the same architecture, and do have access to the same shared drive. However, my space restrictions on the shared drive are tiny, making any kind of shared storage impractical in the general case. However, this does provide interesting opportunities, as my school-given website is accessible through here, making report-generation a possibility. On the other hand, the important part of my semester is the research I&#8217;m doing, not the actual MapReduce client, so I&#8217;m going to try to keep it as simple as possible.</p>
<h2>Abstract Design of <em>MapReduce</em></h2>
<p>The idea of <em>MapReduce</em> is beautifully simple, and takes simple ideas from functional paradigms and applies them to the world of distributed computing. Let&#8217;s say that we have a list of data that we want to hash. Picking an arbitrary list, we&#8217;ll use<em> [1, 2, 3, 4, ... n].</em></p>
<p>First, we apply the<em> map()</em> function to every single value of the array. Assuming our function is named <em>f</em>, our result is <em>[f(1), f(2), f(3), ..., f(n)]</em></p>
<blockquote><p><strong>Aside: What is &#8220;map&#8221;?<br />
</strong></p>
<p>When we &#8220;map&#8221; something, all we are doing is applying the same function to every element of a list, and storing the results in another list. For example, if we have the list <em>[1,2,3,4]</em>, and we apply the function<em> f(x)=2x </em>to this list, we get the list<em> [2,4,6,8]</em>.</p>
<p>In this example, it&#8217;s obvious how individual elements were transformed, but that is not always the case.</p>
<p>What are the constraints on the mappings? For my purpose, it works out best if the map is not always one-to-one, meaning that each time we evaluate the function, we can choose to return nothing, one value, or more values. It *can* be one-to-one, and for some tasks, it is necessary.</p></blockquote>
<p>Then, we take the resulting array and evaluate a &#8220;reduce&#8221; function.</p>
<blockquote><p><strong>Aside: What is &#8220;reduce&#8221;?<br />
</strong><br />
&#8220;Reduce&#8221; is perhaps the least obvious name to pick. Other names for this idea are &#8220;fold&#8221; and &#8220;accumulate&#8221;.</p>
<p>When we &#8220;reduce&#8221;, we are just combining all of the values in a list to a single value by applying some function.</p>
<p>How do we reduce the list <em>[1,2,3,4]</em> using addition? <em>1+2+3+4 = 10</em>. We could also have converted it into a CSV string: <em>&#8220;1,2,3,4&#8243;</em> or &#8220;4,3,2,1&#8243;. There are, of course, subtleties that I am skipping. If you would like more of them, please go <a href="http://en.wikipedia.org/wiki/Reduce_%28higher-order_function%29">here</a>.</p></blockquote>
<p>And there you have it, <em>MapReduce</em>!</p>
<h2>That&#8217;s It?</h2>
<p>As you would expect, we need to make alterations (either above the hood or under the hood) in order to better apply this to the real world.</p>
<p>For our real model, we will borrow from Google&#8217;s design. Before I read <a href="http://labs.google.com/papers/mapreduce.html">Google&#8217;s paper on MapReduce</a>, I drew out how I thought that the model should work, and what I called <em>PreReduce</em> they called a <em>Combiner</em>, which I like better. It allows us, in certain cases, to do some of the heavy lifting of the <em>Reduce</em> on the satellite machines instead of the client machine.</p>
<p>For example, if we were trying to add the numbers from 1 to 100 trillion, why would we bother passing back the numbers <em>[1, 2, 3, 4, ... n]</em> for the host machine to add together? Addition of integers is commutative and associative, so we can add them on the satellite machine and then send back one large integer instead of an arbitrary number of small integers.</p>
<h2>What is our high-level algorithm?</h2>
<p>Ideally, what do we want a client to do?</p>
<ul>
<li>Get Work</li>
<li>Map()</li>
<li>Combine()</li>
<li>Return Work.</li>
<li>Reply to &#8220;Are you alive?&#8221; pings.</li>
<li>Securely send results (as I&#8217;m sure my friends will try to send LoLcat messages to home base).</li>
</ul>
<p>What do we want a server to do?</p>
<ul>
<li>Start all satellite machines.</li>
<li>Respond to requests for data.</li>
<li>Perform <em>Reduce</em></li>
<li>Occasionally check to see which machines are alive.</li>
<li>Generate progress reports (optional).</li>
<li>Track responses and keep a &#8220;Shit List&#8221; of computers that give a few bad responses (throwing the computer in jail for a few minutes and then trusting it again is probably the most practical).</li>
</ul>
<p>So what can we do at a high level to get all of this rolling?</p>
<p><strong>The <em>MapReduce</em> Process:</strong></p>
<ol>
<li><strong>Start a central server that will dole out work tasks</strong>.</li>
<li><strong>The central server starts the satellite machines</strong>. Fortunately, all of the machines share a network drive, so this is a relatively trivial task involving a little SSH action.</li>
<li><strong>The satellite machines phone home for work</strong>. Why do we do it this way instead of letting the central server give work with the start commands? In short, a friend of mine thinks it&#8217;s hysterical to fork-bomb a machine every now and then (and sometimes war erupts during class), so I REALLY can&#8217;t guarantee that ANY of the machines will be working at any given point. If the machines phone home for work, we are more likely to have work units parsed out sooner.</li>
<li><strong>The satellite machines perform map() on every input value. </strong>It is obvious that the satellite machines should be doing the heavy lifting of the map() function. There is no law that says that the resulting map has to be stored in one place, and in this case, it will be stored in 50 places for the time being.</li>
<li><strong>The satellite machines perform combine() on every input value.</strong> As mentioned above, this lets us partition off some of the work of the <em>Reduce</em> function to the satellite machines.</li>
<li><strong>The satellite machines return the results to the central machine.</strong></li>
<li><strong>The central machine pings satellite machines that are long in responding.</strong></li>
<li><strong>Repeat Steps 3-6 until all work units are done.</strong></li>
<li><strong>The central server performs <em>Reduce.</em></strong></li>
</ol>
<h2>What&#8217;s next?</h2>
<p>Next week we will look into how we will organize the SSH connections, how we will login to the machines, and how we will trust &#8220;secure&#8221; answers when the satellites don&#8217;t have any kind of built in cryptography.</p>
<img src="http://www.jakevoytko.com/blog/?ak_action=api_record_view&id=51&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.jakevoytko.com/blog/2008/02/17/macgyvers-mapreduce-in-python-part-1-theory/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
