MacGyver’s MapReduce in Python! Part 2: Local Security

This post was written by Jake on February 24, 2008
Posted Under: Computer Science,Programming,Software Development

Scope of the Article

Intermediate. This is a security discussion, but Python will be used. A basic overview of cryptographic hash functions will also be given.

Note that the lowest version of Python is 2.3 (and I have no power to change this), so I may be restricted from performing currently optimal solutions. Feel free to make suggestions anyways.

For a refresher on the MapReduce process, check out Part 1.

What I’m Up Against Storing Data Locally

Worker computers will be attacked. The people who I share the lab with are my friends, and they will therefore try to do everything they can to mess my day up. Forkbombs and unplugging are probably not out of the question. My friends skills run the gamut from *nix familiarity to mastery of college-level undergrad mathematics and cryptography courses.

Attackers have physical access to the machines. All of the machines are in computer labs. Enough said.

Very small shared drive between computers. The shared drive is large enough to store the final results, but not intermediate calculations.

All computers have different environments, almost none of which I have control over. My personal laptop is an Intel Core 2 Duo with 1.6mHz, and I do have full control of what software it runs. The head machine in the network is a Blade server running Solaris, and it has Python 2.4. The worker machines are Linux machines running Python 2.3.

For the purposes of this exercise, I am considering my own area of the shared network drive secure. Anything that is not taking place within the computer’s memory and within my own shared network drive will be considered “secure”, the truth be damned.

Storing Data on Worker Machines

As I mentioned above, there is not enough room for me to store intermediate data in the shared network drive, so I need to take advantage of the storage capacity of the worker machines. All of the worker machines are sitting right in a computer lab. These labs see mild-to-moderate use every day, so I can’t just use a live CD to start a distributed computing program.

Since the local drives of the machines are open for general use, I need to make a strategy to protect my data. First, let’s look at the naive way to store MapReduce data:

def do_work(self):
    # The only commas are guaranteed to be separators.
    # The work splitting is simplified for this example.
    todo = self.work.split(',')
    temp_file = open(self.tmpfile, 'w')
    for item in todo:
         result = str(mapreduce.map(item))
         temp_file.write(result + '\n')
    temp_file.close()

This gets the job done, but this could give me some serious headaches. What’s to stop someone from potentially altering, inserting, or erasing messages (permissions aside)? We need something called Cryptographic Hash Functions for added security.

Cryptographic Hash Functions

A hash function, in general, takes an arbitrary length message and converts it to a digest of a fixed size. There is no guarantee as to whether or not an attacker can determine the original message, or create a second message with the same digest.

A Cryptographic Hash Function is a hash function with the following properties:

Preimage Resistance: Let hash(message_a) = digest_a. If the attacker is given digest_a, can message_a be determined? If a hash digest is n bits, the preimage resistance is expected to be 2^{n}.

Second Preimage Resistance: If the attacker has a message, message_a, it should be hard to find message_b so that hash(message_a)=hash(message_b). If a hash digest is n bits, the preimage resistance is expected to be 2^{n}.

Collision Resistance: It should be hard for the attacker to make message_a and message_b so that hash(message_a) = hash(message_b). This is expected to have security of 2^{n/2}.

Message extension resistance: It should be hard for the attacker, when given hash(message_a), to easily calculate any possible extension hash(message_a || message_b), where “||” is concatenation.
A few examples of cryptographic hash functions:

  • MD2, MD4, MD5 (Designed by Ron Rivest)
  • Whirlpool (Designed by the makers of AES)
  • SHA-0, SHA-1, SHA-224/256, SHA-384/512 (Designed by NiST)
  • Elliptic Curve Cryptography (Designed by Lenstra, et. al)

It should be noted that the MD series is currently considered broken, it is now considered “practical” to find collisions of SHA-0 with beefy resources. SHA-1 has a theoretical collision attack by a group of Chinese researchers (Wang, Yin, Yu).

Better Data Storage on Worker Machines

I will use Python 2.3′s built-in SHA-0 digest. “But Jake,” I hear you say, “you said that it was insecure! Someone might alter one of your messages!” Let’s look at the tradeoff here: The Wang, Yin, Yu attack takes theoretically 2^{33} operations to complete on average. It’s not out of the realm of possibility to find a collision on an overpowered home computer in a few weeks, or on a cluster within a few days.

However, all of my data results are going to exist on local machines for a few hours, max, before the computation completes and work is sent home. Even if someone wanted to devote the time and energy to find a collision, they would be altering a single data point, when there could potentially be millions. The attackers would either be creating a completely bad data point that can be sanity checked, or they will be spending all of the computation power to throw off my result by a fraction. Needless to say, I’m more worried about losing large sets of data than an attacker adding random messages.

I will be using a method of hashing called “salting” where an extra value is added to the hash to help further disguise the message. This is a common method for storing passwords. As long as the salt isn’t short, the attacker should be unable to create a collision in reasonable time. I will be using a large “nonce”, (or number used once).

Note: I could use encryption for this task, especially with the freely available crypto libraries for Python, but the headache of getting them installed on all of the computers is not what my research is all about. Since it is irrelevant if attackers read the information, I think that it can be shown that encryption and my scheme are both equivalent to finding the encryption key / hash salt of these one-way functions. If you don’t agree, I’d really like to know why.

def do_work(self):
    # The only commas are guaranteed to be separators.
    # The work splitting is simplified for this example.
    todo = self.work.split(',')
    temp_file = open(self.tmpfile, 'w')
    i = 0
    for item in todo:
         result = str(mapreduce.map(item))
         sha_obj = sha.new(self.salt+result+str(i))
         t = str(i)+','+result+','+sha_obj.hexdigest()
         temp_file.write(t + '\n')
         i+=1
    temp_file.close()

The validation code is also simple:

def extract_value(line, salt):
    values = line.split(',')
    if len(values) != 3:
        return 0
    t = salt+values[1].strip()+values[0].strip()
    sha_obj = sha.new(t)
    if sha_obj.hexdigest() != values[2].strip():
        print "Hax"
        return 0
    return values[1]

What’s next?

Next week we will look into network transmissions of the data. Yes, that post is likely to be a lot like this one. Despite my wishes to the contrary, I need to spend more time researching than blogging.

Popularity: 18% [?]

Add a Comment

required, use real name
required, will not be published
optional, your blog address