<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>So Jake Says &#187; String Searching</title>
	<atom:link href="http://www.jakevoytko.com/blog/tag/string-searching/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.jakevoytko.com/blog</link>
	<description>Ye Olde Computer Science Blogge</description>
	<lastBuildDate>Sun, 17 Jan 2010 15:16:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Fun With String Searching</title>
		<link>http://www.jakevoytko.com/blog/2007/12/11/fun-with-string-searching/</link>
		<comments>http://www.jakevoytko.com/blog/2007/12/11/fun-with-string-searching/#comments</comments>
		<pubDate>Wed, 12 Dec 2007 05:41:53 +0000</pubDate>
		<dc:creator>Jake</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[Boyer-Moore]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[KMP]]></category>
		<category><![CDATA[Knuth]]></category>
		<category><![CDATA[Morris]]></category>
		<category><![CDATA[Pratt]]></category>
		<category><![CDATA[Rabin-Karp]]></category>
		<category><![CDATA[String Searching]]></category>

		<guid isPermaLink="false">http://www.jakevoytko.com/blog/2007/12/11/fun-with-string-searching/</guid>
		<description><![CDATA[Source of the program used. Almost every major program that works with text performs string searches. There are some juicy algorithms in this area, each with their own tradeoffs. Choosing the wrong algorithm for the wrong task produces awful results, and we need to carefully weigh the consequences of each algorithm. I&#8217;m going to take [...]]]></description>
			<content:encoded><![CDATA[<p><em><a title="Zip file of source" href="http://www.jakevoytko.com/blog/wp-content/uploads/2007/12/stringsearch.zip">Source</a> of the program used.</em></p>
<p>Almost every major program that works with text performs string searches. There are some juicy algorithms in this area, each with their own tradeoffs. Choosing the wrong algorithm for the wrong task produces awful results, and we need to carefully weigh the consequences of each algorithm. I&#8217;m going to take a look at just three of them: brute force, Boyer-Moore, and Rabin-Karp.</p>
<p>If you want to cut straight to the chase, feel free to scroll to the bottom to see the results section.</p>
<p>Why these three? Each of these algorithms offers a completely different perspective on the idea of finding strings. There are other algorithms that perform specialized tasks (Knuth-Morris-Pratt for repetitive source strings, for example), but we&#8217;re not interested in these today.</p>
<h3>Brute Force: Simple, and (Sometimes) Slow</h3>
<p>As we see in the image below, the brute force algorithm tries to generate a match from every single starting position.</p>
<p>It should be obvious to the reader that the algorithm can&#8217;t possibly miss a match, as every single position is tried. If the pattern exists in the text, brute force search will find it.</p>
<p><em>Note:  Green represents a single character match, and Red represents a single character mismatch.</em><br />
<a title="Brute Force image" href="http://www.jakevoytko.com/blog/wp-content/uploads/2007/12/bruteforcestring.png"><img src="http://www.jakevoytko.com/blog/wp-content/uploads/2007/12/bruteforcestring.png" alt="Brute Force image" /></a></p>
<p>The string &#8220;This is some Test text&#8221; isn&#8217;t very tough on the algorithm, as there is only one mismatch before we find the correct position of the word &#8220;Text&#8221;. However, it is very easy to construct pathological cases for this algorithm. It doesn&#8217;t perform well for binary data, or any highly-repetitive pattern. If it has to match the pattern of a whole bunch of <em>A</em>s with a <em>B</em>, it could very easily perform poorly. If we consider finding <em>AAA..AAB</em> in the string <em>AAA&#8230;&#8230;&#8230;.AAAB</em> it&#8217;s obvious that the algorithm will be taken down a costly blind alley at every single position</p>
<p><strong>Implementation in C++</strong></p>
<p>To check to see if we have a match at any position, all we have to do is check to see if each character of the string is a match. If it is not, we can return <em>false</em>.</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">bool</span> brute_force_match<span style="color: #008000;">&#40;</span><span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> text,
                       <span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> pattern, <span style="color: #0000ff;">int</span> pos<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #666666;">// If the source text is shorter than the</span>
    <span style="color: #666666;">// pattern + the position, we can't possibly</span>
    <span style="color: #666666;">// have a match.</span>
    <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span>pos <span style="color: #000040;">+</span> pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">&amp;</span>gt<span style="color: #008080;">;</span> text.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span>
        <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">false</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #0000ff;">int</span> pattern_size <span style="color: #000080;">=</span> pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #666666;">// Check every single character for a mismatch.</span>
    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> i<span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> i</pre></div></div>

<p>To search every position, we use the following snippet:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">int</span> brute_force<span style="color: #008000;">&#40;</span><span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> text, <span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> pattern<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">int</span> search_size <span style="color: #000080;">=</span> text.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">-</span> pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> i<span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> i <span style="color: #000040;">&amp;</span>lt<span style="color: #008080;">;</span><span style="color: #000080;">=</span> search_size<span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i<span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span>brute_force_match<span style="color: #008000;">&#40;</span>text, pattern, i<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span>
        <span style="color: #008000;">&#123;</span>
            <span style="color: #0000ff;">return</span> i<span style="color: #008080;">;</span>
        <span style="color: #008000;">&#125;</span>
    <span style="color: #008000;">&#125;</span>
    <span style="color: #0000ff;">return</span> <span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p>Brutally simple!</p>
<p><strong>Complexity and Potential Problems</strong></p>
<p>In certain document types, brute force is a valid solution for finding a string. Looking for strings in natural language files is still somewhat fast, as the algorithm is likely to into mismatches pretty fast. There are better algorithms (like Boyer-Moore, as written below), but the implementation time of brute force is not to be taken lightly.</p>
<p>When we don&#8217;t have a source string with a large number of &#8220;false starts&#8221;, we can expect the runtime of the algorithm to be roughly linear with the size of the file. It will have a runtime of roughly <em>M+N</em>, for <em>M</em> = text size, and <em>N</em> = pattern size. However, for searching in binary strings, or in strings with many repetitions of close matches to our pattern, the runtime is closer to <em>M*N</em>.</p>
<p>It takes less than 1/10 of a second to find a small string at the end of &#8220;Moby Dick&#8221; (see below)! I wasn&#8217;t joking when I said that brute force is a fast algorithm for certain inputs. Treat this algorithm like Bubblesort: use it for the plainest and smallest inputs.</p>
<h3>Boyer-Moore: Right-to-Left Trickery</h3>
<p>One of the assumptions that we made for the brute force algorithm is the left-to-right search. The direction of the search isn&#8217;t a requirement. In fact, we can search through the text however we please!</p>
<p>Since we are given the whole pattern string, we have a tremendous amount of information at our disposal. We just need to know what to look for. For this algorithm, when we come across a mismatch, there might be information from the mismatch that helps us speed up the search.</p>
<p>Let&#8217;s see an example of how we can use right-to-left searching combined with mismatches to search faster:</p>
<p><em>Note:  Green represents a single character match, and Red represents a single character mismatch.</em><br />
<a title="Boyer-Moore img" href="http://www.jakevoytko.com/blog/wp-content/uploads/2007/12/boyermoorestring.png"><img src="http://www.jakevoytko.com/blog/wp-content/uploads/2007/12/boyermoorestring.png" alt="Boyer-Moore img" /></a></p>
<p>We see that on step 1, the &#8216;s&#8217; doesn&#8217;t match the &#8216;t&#8217;, so we slide the pattern forward 1. Please note that this will line the &#8216;s&#8217; in &#8220;Test&#8221; up with the &#8216;s&#8217; in &#8220;This&#8221;.</p>
<p>Step 2 demonstrates the magic of the algorithm: there isn&#8217;t a &lt;space&gt; at all in &#8220;Test&#8221;, so we can completely skip over the letter in the source text wholesale. We see this happen again in step 4 with the &#8216;o&#8217;. There is no &#8216;o&#8217; in &#8220;Test&#8221;, so why waste time trying to match it?</p>
<p>The Boyer-Moore algorithm takes advantage of these jumps. We search from left to right, looking for the first mismatch. When we find one, we can compute how far we&#8217;ll need to slide our search pattern ahead in order to match the character. We can also use a lookup table to speed up the calculation of how far we need to jump.</p>
<p><strong>Implementation in C++</strong></p>
<p>The size of the lookup table for a pattern of <em>N</em> characters is 256 * <em>N</em> bytes for ASCII alphabets, which is cake on modern computers. For larger alphabets, such as Unicode, you need an associative data container in order to be able to implement the algorithm in memory.</p>
<p>The algorithm itself is very easy to follow:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">int</span> boyer_moore<span style="color: #008000;">&#40;</span><span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> text, <span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> pattern<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
  <span style="color: #0000ff;">int</span> search_size<span style="color: #000080;">=</span> text.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">-</span> pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
  <span style="color: #0000ff;">int</span> other_ind, word_start<span style="color: #008080;">;</span>
  vector <span style="color: #000040;">&amp;</span>gt<span style="color: #008080;">;</span> skip_table<span style="color: #008080;">;</span>
&nbsp;
  initialize_skip<span style="color: #008000;">&#40;</span>skip_table, pattern<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
  <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span>word_start<span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> word_start<span style="color: #000040;">&amp;</span>lt<span style="color: #008080;">;</span><span style="color: #000080;">=</span>search_size<span style="color: #008080;">;</span> <span style="color: #000040;">++</span>word_start<span style="color: #008000;">&#41;</span>
  <span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> i<span style="color: #000080;">=</span>pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008080;">;</span> i<span style="color: #000040;">&amp;</span>gt<span style="color: #008080;">;</span><span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> <span style="color: #000040;">--</span>i<span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
      <span style="color: #666666;">// If there is a mismatch</span>
      <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span>text<span style="color: #008000;">&#91;</span>word_start <span style="color: #000040;">+</span> i<span style="color: #008000;">&#93;</span> <span style="color: #000040;">!</span><span style="color: #000080;">=</span> pattern<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span>
      <span style="color: #008000;">&#123;</span>
        other_ind <span style="color: #000080;">=</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#40;</span>text<span style="color: #008000;">&#91;</span>word_start <span style="color: #000040;">+</span>i<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
        <span style="color: #666666;">// Skip forward the proper number of steps.</span>
        <span style="color: #666666;">// Look this up in the lookup table.</span>
        word_start <span style="color: #000040;">+</span><span style="color: #000080;">=</span> skip_table<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#91;</span>other_ind<span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
&nbsp;
        <span style="color: #0000ff;">break</span><span style="color: #008080;">;</span>
      <span style="color: #008000;">&#125;</span>
&nbsp;
      <span style="color: #666666;">// If we get to the beginning of the string</span>
      <span style="color: #666666;">// and still haven't found a mismatch,</span>
      <span style="color: #666666;">// we have the position.</span>
      <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span>i <span style="color: #000080;">==</span> <span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span>
        <span style="color: #0000ff;">return</span> word_start<span style="color: #008080;">;</span>
    <span style="color: #008000;">&#125;</span>
  <span style="color: #008000;">&#125;</span>
&nbsp;
  <span style="color: #0000ff;">return</span> <span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p>As you see, once we&#8217;ve built the skip table, it&#8217;s very easy to look up how much the character pointer should skip forward at each step.</p>
<p>But how do we build the skip table? Let&#8217;s look at <code>initialize_skip()</code></p>
<p>Let&#8217;s say that we have the string &#8220;testing&#8221;. The skip table for the letter &#8216;i&#8217; will start with every single value initialized to the maximum skip, 5.</p>
<p>We then iterate backwards through the string, finding the first instance of each letter. If we find a &#8220;t&#8221; instead of an &#8220;i&#8221;, we can only skip 1. If we find &#8220;s&#8221; or &#8220;e&#8221;, we can skip 2 and 3 respectively.</p>
<p>Notice that we don&#8217;t assume that it might be the &#8220;t&#8221; that starts the word. If we did that, we might jump the pattern too far and miss the match. We only want the first match going from right to left.</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">void</span> initialize_skip<span style="color: #008000;">&#40;</span>vector <span style="color: #000040;">&amp;</span>gt<span style="color: #008080;">;</span><span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> skip_table,
                    <span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> pattern<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
  <span style="color: #0000ff;">unsigned</span> <span style="color: #0000ff;">int</span> pattern_size <span style="color: #000080;">=</span> pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
  <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> i<span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> i
 to_push<span style="color: #008000;">&#40;</span>ALPHABET_SIZE, max_skip<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #666666;">// Have we seen a character yet?</span>
    vector used<span style="color: #008000;">&#40;</span>ALPHABET_SIZE, <span style="color: #0000ff;">false</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #0000ff;">int</span> counter <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> j<span style="color: #000080;">=</span>i<span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008080;">;</span> j<span style="color: #000040;">&amp;</span>gt<span style="color: #008080;">;</span><span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> <span style="color: #000040;">--</span>j<span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
      <span style="color: #666666;">// If we haven't used a character</span>
      <span style="color: #666666;">// yet, add its value in the lookup table</span>
      <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span><span style="color: #000040;">!</span>used<span style="color: #008000;">&#91;</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#40;</span>pattern<span style="color: #008000;">&#91;</span>j<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span>
      <span style="color: #008000;">&#123;</span>
        to_push<span style="color: #008000;">&#91;</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#40;</span>pattern<span style="color: #008000;">&#91;</span>j<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#93;</span><span style="color: #000080;">=</span>counter<span style="color: #008080;">;</span>
        used<span style="color: #008000;">&#91;</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#40;</span>pattern<span style="color: #008000;">&#91;</span>j<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> <span style="color: #0000ff;">true</span><span style="color: #008080;">;</span>
      <span style="color: #008000;">&#125;</span>
&nbsp;
      counter<span style="color: #000040;">++</span><span style="color: #008080;">;</span>
    <span style="color: #008000;">&#125;</span>
    skip_table.<span style="color: #007788;">push_back</span><span style="color: #008000;">&#40;</span>to_push<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
  <span style="color: #008000;">&#125;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><strong>Complexity and Potential Problems</strong></p>
<p>The worst case runtime of this algorithm is still <em>N*M</em>, although the average case for many different kinds of text (including natural language), is roughly <em>M/N</em>.</p>
<p>Pathological cases are still easy to construct. Try coming up with one yourself, or scroll down to the test section to see a sample string that can be pathological!</p>
<p>Worst case aside, this algorithm is incredibly useful on many different kinds of inputs, <em>especially</em> natural text. Having an average of <em>M/N</em> runtime is awesome, and is heads and shoulders better than brute force.</p>
<h3>Rabin-Karp: Clever Use of Hash Values</h3>
<p><em><strong>Extreme Math Warning</strong></em></p>
<p>The above algorithms both suffer from the same weakness: It is easy to construct pathological cases. If consistency is the goal of your program, these algorithms obviously would fall flat on their face for certain input types. It doesn&#8217;t matter if the algorithm runs fast in some cases: as soon as your user has to wait for your program to respond, they&#8217;re not happy.</p>
<p>Enter the Rabin-Karp string searching method. It treats the string like it is a large number that has been entered into a hash table. We consider consecutive characters in a string all as the same number.</p>
<p>For example, have the string &#8220;abba&#8221;, with the alphabet size = 256, the hash value is calculated as follows:</p>
<p><img src='/blog/wp-content/plugins/latexrender/pictures/3e13f2edadcb426f4c474d93ce2d5d01_1.83333pt.gif' title='$a*256^{3} + b*256^{2} + b*256 + a$' alt='$a*256^{3} + b*256^{2} + b*256 + a$'  style="vertical-align:-1.83333pt;" ></p>
<p>For more on the subject, see my older posts on <a href="http://www.jakevoytko.com/blog/2007/09/16/number-theory-for-programmers-part-1/">number theory</a> and <a href="http://www.jakevoytko.com/blog/2007/09/30/number-theory-hash-tables-and-geometric-progressions/">hash tables</a>.</p>
<p>The algorithm is as follows: we first generate the hash of the pattern, and the hash of the first <em>N</em> characters of the text. If the two hash values are equal, great! If they&#8217;re not equal, we remove the highest term, multiply the value by the alphabet size, and then add the next term. Simple, right?</p>
<p>Actually, I didn&#8217;t even understand that. Let&#8217;s see it written out with TeX.</p>
<p>The source text string: <strong>aabba</strong><br />
The pattern to find: abba</p>
<p>The first four letters of the source, hashed:<br />
<img src='/blog/wp-content/plugins/latexrender/pictures/d39dfac15d8869018bde48727f775e81_3.5pt.gif' title='$a*256^{3} + a*256^{2} + b*256 + b (mod\ p)$' alt='$a*256^{3} + a*256^{2} + b*256 + b (mod\ p)$'  style="vertical-align:-3.5pt;" ></p>
<p>This is extraordinarily unlikely to produce a hash value that matches the hash of the text &#8220;abba&#8221;. We first remove the highest term, as we no longer want it in the hash.<br />
<img src='/blog/wp-content/plugins/latexrender/pictures/f38d74121339a949467463911d1b8dfb.gif' title='$a*256^{3} + a*256^{2} + b*256 + b &amp;#8211; a*256^{3}$' alt='$a*256^{3} + a*256^{2} + b*256 + b &amp;#8211; a*256^{3}$'  align=absmiddle><br />
<img src='/blog/wp-content/plugins/latexrender/pictures/f6debaf04f6998f67d2060bf0930133f_3.5pt.gif' title='$\ =\ a*256^{2} + b*256 + b(mod\ p)$' alt='$\ =\ a*256^{2} + b*256 + b(mod\ p)$'  style="vertical-align:-3.5pt;" ></p>
<p>We then multiply the value by the alphabet size, 256:<br />
<img src='/blog/wp-content/plugins/latexrender/pictures/0b44e720378ca560f6c53aaa0bab33ca_3.5pt.gif' title='$256 * (a*256^{2} + b*256 + b)$' alt='$256 * (a*256^{2} + b*256 + b)$'  style="vertical-align:-3.5pt;" ><br />
<img src='/blog/wp-content/plugins/latexrender/pictures/32a9a9b8e11a4f8235b8edc2d6fb2103_3.5pt.gif' title='$\ =\ a*256^{3} + b*256^{2} + b*256(mod\ p)$' alt='$\ =\ a*256^{3} + b*256^{2} + b*256(mod\ p)$'  style="vertical-align:-3.5pt;" ></p>
<p>And add the next character in the sequence, a:<br />
<img src='/blog/wp-content/plugins/latexrender/pictures/153a9472031d88b785993e1453fd4471_3.5pt.gif' title='$a*256^{3} + b*256^{2} + b*256 + a(mod\ p)$' alt='$a*256^{3} + b*256^{2} + b*256 + a(mod\ p)$'  style="vertical-align:-3.5pt;" ></p>
<p>Which obviously DOES hash to the value of the search string &#8220;abba&#8221;, as it&#8217;s the exact same thing we had above. Brilliant!</p>
<p>There <em>is</em> a small chance that there will be a collision for text values that don&#8217;t match our source string. We still need to check the string in place in order to make sure that we have the correct string, and not one that coincidentally matches.</p>
<p><strong>Implementation in C++</strong></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">int</span> rabin_karp<span style="color: #008000;">&#40;</span><span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> text, <span style="color: #0000ff;">const</span> string<span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> pattern<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">int</span> search_size<span style="color: #000080;">=</span> text.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">-</span> pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">int</span> pattern_size <span style="color: #000080;">=</span> pattern.<span style="color: #007788;">size</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #666666;">// A large prime such that prime &amp;lt; (2^32)/(ALPHABET_SIZE+1)</span>
    <span style="color: #666666;">//</span>
    <span style="color: #666666;">// This is our hash value</span>
    <span style="color: #0000ff;">unsigned</span> <span style="color: #0000ff;">int</span> prime <span style="color: #000080;">=</span> <span style="color: #0000dd;">5800079</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #0000ff;">unsigned</span> <span style="color: #0000ff;">int</span> alphabetM<span style="color: #000080;">=</span><span style="color: #0000dd;">1</span><span style="color: #008080;">;</span>
&nbsp;
    <span style="color: #666666;">// Less likely to overflow than powmod()</span>
    <span style="color: #0000ff;">for</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> i<span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> i
<span style="color: #000040;">&amp;</span>lt<span style="color: #008080;">;</span><span style="color: #000080;">=</span>search_size<span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i<span style="color: #008000;">&#41;</span>
    <span style="color: #008000;">&#123;</span>
        <span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span>patternHash <span style="color: #000080;">==</span> textHash <span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span><span style="color: #000040;">&amp;</span>amp<span style="color: #008080;">;</span> brute_force_match<span style="color: #008000;">&#40;</span>text, pattern, i<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span>
            <span style="color: #0000ff;">return</span> i<span style="color: #008080;">;</span>
&nbsp;
        textHash <span style="color: #000080;">=</span> <span style="color: #008000;">&#40;</span>textHash <span style="color: #000040;">+</span> ALPHABET_SIZE<span style="color: #000040;">*</span>prime
                        <span style="color: #000040;">-</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#40;</span>text<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #000040;">*</span>alphabetM<span style="color: #008000;">&#41;</span> <span style="color: #000040;">%</span> prime<span style="color: #008080;">;</span>
&nbsp;
        textHash <span style="color: #000080;">=</span> <span style="color: #008000;">&#40;</span>textHash<span style="color: #000040;">*</span>ALPHABET_SIZE
                        <span style="color: #000040;">+</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#40;</span>text<span style="color: #008000;">&#91;</span>i<span style="color: #000040;">+</span>pattern_size<span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">%</span> prime<span style="color: #008080;">;</span>
    <span style="color: #008000;">&#125;</span>
    <span style="color: #0000ff;">return</span> <span style="color: #000040;">-</span><span style="color: #0000dd;">1</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><strong>Complexity and Potential Problems</strong></p>
<p>The worst case of this algorithm is <em>M*N</em>, if you were to come up with a document where every single location hashes to the same value. However, this is <em>extremely</em> unlikely to ever happen, even if you&#8217;re trying to craft the worst-case document. My test of &#8220;Moby Dick&#8221; didn&#8217;t even come up with a single collision in the entire text of the novel, and the book is pretty damn big.</p>
<h3>Running the Algorithms</h3>
<p>Let&#8217;s take a look at how the algorithms fare when they compete to find the string &#8220;devious-cruising Rachel&#8221; in the text of &#8220;Moby Dick&#8221;. Aside from being a hysterical phrase, &#8220;devious-cruising Rachel&#8221; appears a single time in the book: in the very last paragraph.</p>
<p><em>The program was compiled with -O2</em></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;">Loading Moby Dick
&nbsp;
Text length is <span style="color: #0000dd;">1201110</span> characters
&nbsp;
Position of <span style="color: #FF0000;">&quot;devious-cruising Rachel&quot;</span><span style="color: #008080;">:</span> <span style="color: #0000dd;">1201001</span>
Brute force took <span style="color:#800080;">0.007865</span> seconds
&nbsp;
Position of <span style="color: #FF0000;">&quot;devious-cruising Rachel&quot;</span><span style="color: #008080;">:</span> <span style="color: #0000dd;">1201001</span>
Boyer<span style="color: #000040;">-</span>Moore took <span style="color:#800080;">0.001762</span> seconds
&nbsp;
Position of <span style="color: #FF0000;">&quot;devious-cruising Rachel&quot;</span><span style="color: #008080;">:</span> <span style="color: #0000dd;">1201001</span>
Rabin<span style="color: #000040;">-</span>Karp took <span style="color:#800080;">0.025224</span> seconds</pre></div></div>

<p>And Boyer-Moore is the winner by a long shot!</p>
<p>Rabin-Karp is much slower on this run! However, we are going to get consistent performance on that algorithm, so don&#8217;t count it out yet.</p>
<p>Let&#8217;s see how the various methods deal with the worst-case for the brute force algorithm:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;">Loading Brute Force torture test
&nbsp;
Text length is <span style="color: #0000dd;">11015500</span> characters
&nbsp;
Position of <span style="color: #FF0000;">&quot;aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab&quot;</span>
        <span style="color: #008080;">:</span> <span style="color: #0000dd;">110113</span>
Brute force took <span style="color:#800080;">0.010162</span> seconds
&nbsp;
Position of <span style="color: #FF0000;">&quot;aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab&quot;</span>
        <span style="color: #008080;">:</span> <span style="color: #0000dd;">110113</span>
Boyer<span style="color: #000040;">-</span>Moore took <span style="color:#800080;">0.001673</span> seconds
&nbsp;
Position of <span style="color: #FF0000;">&quot;aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab&quot;</span>
        <span style="color: #008080;">:</span> <span style="color: #0000dd;">110113</span>
Rabin<span style="color: #000040;">-</span>Karp took <span style="color:#800080;">0.00218</span> seconds</pre></div></div>

<p>Rabin-Karp has pulled into second for this particular run. Again, it&#8217;s a consistent bugger.</p>
<p>And now let&#8217;s look at a worst-case file for the Boyer-Moore algorithm:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;">Loading Boyer<span style="color: #000040;">-</span>Moore torture test
&nbsp;
Text length is <span style="color: #0000dd;">11015556</span> characters
&nbsp;
Position of <span style="color: #FF0000;">&quot;baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&quot;</span>
        <span style="color: #008080;">:</span> <span style="color: #0000dd;">110154</span>
Brute force took <span style="color:#800080;">0.000656</span> seconds
&nbsp;
Position of <span style="color: #FF0000;">&quot;baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&quot;</span>
        <span style="color: #008080;">:</span> <span style="color: #0000dd;">110154</span>
Boyer<span style="color: #000040;">-</span>Moore took <span style="color:#800080;">0.010248</span> seconds
&nbsp;
Position of <span style="color: #FF0000;">&quot;baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&quot;</span>
        <span style="color: #008080;">:</span> <span style="color: #0000dd;">110154</span>
Rabin<span style="color: #000040;">-</span>Karp took <span style="color:#800080;">0.00221</span> seconds</pre></div></div>

<p>We see the Boyer-Moore algorithm has an awful runtime, from all of the backtracking that it is doing. Again, the Rabin-Karp algorithm is looking pretty consistent!</p>
<h3>Where Do I Go Next?</h3>
<p>The algorithms described above were written to find a single pattern in a string of text. We still haven&#8217;t considered the problem of finding all of the matches for a certain string in a document. There is also the  Knuth-Morris-Pratt algorithm, an algorithm that works well for strings that are highly repetitive in nature, and is an improvement upon brute-force otherwise.</p>
<img src="http://www.jakevoytko.com/blog/?ak_action=api_record_view&id=23&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.jakevoytko.com/blog/2007/12/11/fun-with-string-searching/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>
