Fun With String Searching

This post was written by Jake on December 11, 2007
Posted Under: Computer Science,Programming,Software Development

Source of the program used.

Almost every major program that works with text performs string searches. There are some juicy algorithms in this area, each with their own tradeoffs. Choosing the wrong algorithm for the wrong task produces awful results, and we need to carefully weigh the consequences of each algorithm. I’m going to take a look at just three of them: brute force, Boyer-Moore, and Rabin-Karp.

If you want to cut straight to the chase, feel free to scroll to the bottom to see the results section.

Why these three? Each of these algorithms offers a completely different perspective on the idea of finding strings. There are other algorithms that perform specialized tasks (Knuth-Morris-Pratt for repetitive source strings, for example), but we’re not interested in these today.

Brute Force: Simple, and (Sometimes) Slow

As we see in the image below, the brute force algorithm tries to generate a match from every single starting position.

It should be obvious to the reader that the algorithm can’t possibly miss a match, as every single position is tried. If the pattern exists in the text, brute force search will find it.

Note: Green represents a single character match, and Red represents a single character mismatch.
Brute Force image

The string “This is some Test text” isn’t very tough on the algorithm, as there is only one mismatch before we find the correct position of the word “Text”. However, it is very easy to construct pathological cases for this algorithm. It doesn’t perform well for binary data, or any highly-repetitive pattern. If it has to match the pattern of a whole bunch of As with a B, it could very easily perform poorly. If we consider finding AAA..AAB in the string AAA……….AAAB it’s obvious that the algorithm will be taken down a costly blind alley at every single position

Implementation in C++

To check to see if we have a match at any position, all we have to do is check to see if each character of the string is a match. If it is not, we can return false.

bool brute_force_match(const string& text,
                       const string& pattern, int pos)
{
    // If the source text is shorter than the
    // pattern + the position, we can't possibly
    // have a match.
    if(pos + pattern.size() > text.size())
        return false;
 
    int pattern_size = pattern.size();
 
    // Check every single character for a mismatch.
    for(int i=0; i

To search every position, we use the following snippet:

int brute_force(const string& text, const string& pattern)
{
    int search_size = text.size() - pattern.size();
 
    for(int i=0; i <= search_size; ++i)
    {
        if(brute_force_match(text, pattern, i))
        {
            return i;
        }
    }
    return -1;
}

Brutally simple!

Complexity and Potential Problems

In certain document types, brute force is a valid solution for finding a string. Looking for strings in natural language files is still somewhat fast, as the algorithm is likely to into mismatches pretty fast. There are better algorithms (like Boyer-Moore, as written below), but the implementation time of brute force is not to be taken lightly.

When we don’t have a source string with a large number of “false starts”, we can expect the runtime of the algorithm to be roughly linear with the size of the file. It will have a runtime of roughly M+N, for M = text size, and N = pattern size. However, for searching in binary strings, or in strings with many repetitions of close matches to our pattern, the runtime is closer to M*N.

It takes less than 1/10 of a second to find a small string at the end of “Moby Dick” (see below)! I wasn’t joking when I said that brute force is a fast algorithm for certain inputs. Treat this algorithm like Bubblesort: use it for the plainest and smallest inputs.

Boyer-Moore: Right-to-Left Trickery

One of the assumptions that we made for the brute force algorithm is the left-to-right search. The direction of the search isn’t a requirement. In fact, we can search through the text however we please!

Since we are given the whole pattern string, we have a tremendous amount of information at our disposal. We just need to know what to look for. For this algorithm, when we come across a mismatch, there might be information from the mismatch that helps us speed up the search.

Let’s see an example of how we can use right-to-left searching combined with mismatches to search faster:

Note: Green represents a single character match, and Red represents a single character mismatch.
Boyer-Moore img

We see that on step 1, the ‘s’ doesn’t match the ‘t’, so we slide the pattern forward 1. Please note that this will line the ‘s’ in “Test” up with the ‘s’ in “This”.

Step 2 demonstrates the magic of the algorithm: there isn’t a <space> at all in “Test”, so we can completely skip over the letter in the source text wholesale. We see this happen again in step 4 with the ‘o’. There is no ‘o’ in “Test”, so why waste time trying to match it?

The Boyer-Moore algorithm takes advantage of these jumps. We search from left to right, looking for the first mismatch. When we find one, we can compute how far we’ll need to slide our search pattern ahead in order to match the character. We can also use a lookup table to speed up the calculation of how far we need to jump.

Implementation in C++

The size of the lookup table for a pattern of N characters is 256 * N bytes for ASCII alphabets, which is cake on modern computers. For larger alphabets, such as Unicode, you need an associative data container in order to be able to implement the algorithm in memory.

The algorithm itself is very easy to follow:

int boyer_moore(const string&amp; text, const string&amp; pattern)
{
  int search_size= text.size() - pattern.size();
  int other_ind, word_start;
  vector &gt; skip_table;
 
  initialize_skip(skip_table, pattern);
 
  for(word_start=0; word_start&lt;=search_size; ++word_start)
  {
    for(int i=pattern.size()-1; i&gt;=0; --i)
    {
      // If there is a mismatch
      if(text[word_start + i] != pattern[i])
      {
        other_ind = (int)(text[word_start +i]);
 
        // Skip forward the proper number of steps.
        // Look this up in the lookup table.
        word_start += skip_table[i][other_ind];
 
        break;
      }
 
      // If we get to the beginning of the string
      // and still haven't found a mismatch,
      // we have the position.
      else if(i == 0)
        return word_start;
    }
  }
 
  return -1;
}

As you see, once we’ve built the skip table, it’s very easy to look up how much the character pointer should skip forward at each step.

But how do we build the skip table? Let’s look at initialize_skip()

Let’s say that we have the string “testing”. The skip table for the letter ‘i’ will start with every single value initialized to the maximum skip, 5.

We then iterate backwards through the string, finding the first instance of each letter. If we find a “t” instead of an “i”, we can only skip 1. If we find “s” or “e”, we can skip 2 and 3 respectively.

Notice that we don’t assume that it might be the “t” that starts the word. If we did that, we might jump the pattern too far and miss the match. We only want the first match going from right to left.

void initialize_skip(vector &gt;&amp; skip_table,
                    const string&amp; pattern)
{
  unsigned int pattern_size = pattern.size();
 
  for(int i=0; i
 to_push(ALPHABET_SIZE, max_skip);
 
    // Have we seen a character yet?
    vector used(ALPHABET_SIZE, false);
 
    int counter = 0;
 
    for(int j=i-1; j&gt;=0; --j)
    {
      // If we haven't used a character
      // yet, add its value in the lookup table
      if(!used[(int)(pattern[j])])
      {
        to_push[(int)(pattern[j])]=counter;
        used[(int)(pattern[j])] = true;
      }
 
      counter++;
    }
    skip_table.push_back(to_push);
  }
}

Complexity and Potential Problems

The worst case runtime of this algorithm is still N*M, although the average case for many different kinds of text (including natural language), is roughly M/N.

Pathological cases are still easy to construct. Try coming up with one yourself, or scroll down to the test section to see a sample string that can be pathological!

Worst case aside, this algorithm is incredibly useful on many different kinds of inputs, especially natural text. Having an average of M/N runtime is awesome, and is heads and shoulders better than brute force.

Rabin-Karp: Clever Use of Hash Values

Extreme Math Warning

The above algorithms both suffer from the same weakness: It is easy to construct pathological cases. If consistency is the goal of your program, these algorithms obviously would fall flat on their face for certain input types. It doesn’t matter if the algorithm runs fast in some cases: as soon as your user has to wait for your program to respond, they’re not happy.

Enter the Rabin-Karp string searching method. It treats the string like it is a large number that has been entered into a hash table. We consider consecutive characters in a string all as the same number.

For example, have the string “abba”, with the alphabet size = 256, the hash value is calculated as follows:

$a*256^{3} + b*256^{2} + b*256 + a$

For more on the subject, see my older posts on number theory and hash tables.

The algorithm is as follows: we first generate the hash of the pattern, and the hash of the first N characters of the text. If the two hash values are equal, great! If they’re not equal, we remove the highest term, multiply the value by the alphabet size, and then add the next term. Simple, right?

Actually, I didn’t even understand that. Let’s see it written out with TeX.

The source text string: aabba
The pattern to find: abba

The first four letters of the source, hashed:
$a*256^{3} + a*256^{2} + b*256 + b (mod\ p)$

This is extraordinarily unlikely to produce a hash value that matches the hash of the text “abba”. We first remove the highest term, as we no longer want it in the hash.
$a*256^{3} + a*256^{2} + b*256 + b &#8211; a*256^{3}$
$\ =\ a*256^{2} + b*256 + b(mod\ p)$

We then multiply the value by the alphabet size, 256:
$256 * (a*256^{2} + b*256 + b)$
$\ =\ a*256^{3} + b*256^{2} + b*256(mod\ p)$

And add the next character in the sequence, a:
$a*256^{3} + b*256^{2} + b*256 + a(mod\ p)$

Which obviously DOES hash to the value of the search string “abba”, as it’s the exact same thing we had above. Brilliant!

There is a small chance that there will be a collision for text values that don’t match our source string. We still need to check the string in place in order to make sure that we have the correct string, and not one that coincidentally matches.

Implementation in C++

int rabin_karp(const string&amp; text, const string&amp; pattern)
{
    int search_size= text.size() - pattern.size();
    int pattern_size = pattern.size();
 
    // A large prime such that prime &lt; (2^32)/(ALPHABET_SIZE+1)
    //
    // This is our hash value
    unsigned int prime = 5800079;
 
    unsigned int alphabetM=1;
 
    // Less likely to overflow than powmod()
    for(int i=0; i
&lt;=search_size; ++i)
    {
        if(patternHash == textHash &amp;&amp; brute_force_match(text, pattern, i))
            return i;
 
        textHash = (textHash + ALPHABET_SIZE*prime
                        - (int)(text[i])*alphabetM) % prime;
 
        textHash = (textHash*ALPHABET_SIZE
                        + (int)(text[i+pattern_size])) % prime;
    }
    return -1;
}

Complexity and Potential Problems

The worst case of this algorithm is M*N, if you were to come up with a document where every single location hashes to the same value. However, this is extremely unlikely to ever happen, even if you’re trying to craft the worst-case document. My test of “Moby Dick” didn’t even come up with a single collision in the entire text of the novel, and the book is pretty damn big.

Running the Algorithms

Let’s take a look at how the algorithms fare when they compete to find the string “devious-cruising Rachel” in the text of “Moby Dick”. Aside from being a hysterical phrase, “devious-cruising Rachel” appears a single time in the book: in the very last paragraph.

The program was compiled with -O2

Loading Moby Dick
 
Text length is 1201110 characters
 
Position of "devious-cruising Rachel": 1201001
Brute force took 0.007865 seconds
 
Position of "devious-cruising Rachel": 1201001
Boyer-Moore took 0.001762 seconds
 
Position of "devious-cruising Rachel": 1201001
Rabin-Karp took 0.025224 seconds

And Boyer-Moore is the winner by a long shot!

Rabin-Karp is much slower on this run! However, we are going to get consistent performance on that algorithm, so don’t count it out yet.

Let’s see how the various methods deal with the worst-case for the brute force algorithm:

Loading Brute Force torture test
 
Text length is 11015500 characters
 
Position of "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
        : 110113
Brute force took 0.010162 seconds
 
Position of "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
        : 110113
Boyer-Moore took 0.001673 seconds
 
Position of "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
        : 110113
Rabin-Karp took 0.00218 seconds

Rabin-Karp has pulled into second for this particular run. Again, it’s a consistent bugger.

And now let’s look at a worst-case file for the Boyer-Moore algorithm:

Loading Boyer-Moore torture test
 
Text length is 11015556 characters
 
Position of "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
        : 110154
Brute force took 0.000656 seconds
 
Position of "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
        : 110154
Boyer-Moore took 0.010248 seconds
 
Position of "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
        : 110154
Rabin-Karp took 0.00221 seconds

We see the Boyer-Moore algorithm has an awful runtime, from all of the backtracking that it is doing. Again, the Rabin-Karp algorithm is looking pretty consistent!

Where Do I Go Next?

The algorithms described above were written to find a single pattern in a string of text. We still haven’t considered the problem of finding all of the matches for a certain string in a document. There is also the Knuth-Morris-Pratt algorithm, an algorithm that works well for strings that are highly repetitive in nature, and is an improvement upon brute-force otherwise.

Popularity: 40% [?]

Reader Comments

That was a very helpful and informative post. I’ve been trying to read about string pattern matching algorithms, but I hadn’t previously been able to make much sense of them. I hope you write about more algorithms, like maybe KMP, which I’m still having trouble with, in the future!

#1 
Written By Daniel Ehrenberg on December 12th, 2007 @ 12:22 am

the third algorithm is very interesting! i wonder why it works…

#2 
Written By Chii on December 12th, 2007 @ 12:29 am

I’d be interested to know how your algorithms test if all your multiply operations are replaced with bitshifts in Rabin-Karp.

So,
ALPHABET_SIZE = 256;
x = … y

#3 
Written By Chief on December 12th, 2007 @ 12:58 am

Hi Jake, thanks for the post. I found it interesting and checked out the sources you provided to play a little bit. I found two bugs you might be interested in:

- You read in the files with getline(), which discards the newline. This has two effects: (a) the reported match position is not helpful to find the string in the file. (b) Boyre-Moore isn’t really “tortured”, as the match is already at the beginning of the file. You can see this from your posted results: match already at position 110,154 of 11,015,556 characters. Using read() to read the file avoids both problems and also speeds up reading the large files:
string result;
static const int buffer_size = 1

#4 
Written By Andreas Bernauer on December 12th, 2007 @ 6:02 am

I could improve Rabin-Karp by replacing the hash function with the simplest one: just adding all characters.

Of course, this results in a lot of hash equalities where there is no match, but the extra brute_force_match calls still cost less than the overhead for the real hash value.

#5 
Written By Andreas Bernauer on December 12th, 2007 @ 6:51 am

As written, Rabin-Karp is basically the linear search except you’re comparing 4 characters (sort of) at a time. Without bothering to read up on it, I imagine it could win out if you’re searching a fixed text for a large number of substrings. (Assuming you build an actual hash table.)

IIRC both Boyer-Moore and KMP optimize the other way, for finding one string in many texts or at least with a much larger haystack than needle. That’s the common case, but not the only case.

#6 
Written By josh on December 12th, 2007 @ 5:41 pm

A truly great post. Thanks for the good explanation. My Data Structures professor could not explain that topic properly. Since I am mostly using C#, I tend to avoid the C/C++ way of thinking, but after I read your post, I felt really comfortable.

Thanks once again

#7 
Written By Preslav Rachev on October 19th, 2008 @ 3:03 pm

excellent presentation !
The hardest in math is to explain things simply not to just understand them and in this you are a winner.
as said before and more relevant to my case, Rabin-Karp is the best for multi-pattern searches (see wikipedia)

#8 
Written By eran on June 16th, 2009 @ 6:57 am

Trackbacks

Add a Comment

required, use real name
required, will not be published
optional, your blog address