Basic Speed Optimizations for the Evil

Posted on October 7, 2007
Filed Under Programming |

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
~Donald Knuth

You’ve designed and written your application. You’ve carefully analyzed the algorithm, and it can’t possibly have a lower algorithmic complexity and still do its job. However, your program is still taking 11 seconds to iterate through all data, and that’s absolutely unacceptable. You may now begin to optimize your code.

At this point, inexperienced programmers fumble the ball here: they begin to try to figure out where the bottleneck is in their program. By themselves. By looking at their code.

Always profile your code when you want to improve it.

No exceptions. When optimizing for speed, you always need to know which 20% of the code is taking up 80% of the execution time. A profiler will easily (and always) tell you which functions are being called the most, and how much time your program spends executing them. Experience can only give you a guess as to what should be improved, but profilers always tell you where the code is spending its time being executed.

It’s not magic, it’s common sense.

Understand the Tradeoff of Optimization

How will we profile code?

I will be using gprof and C++ in order to produce examples, but the ideas presented below will be universally applicable between procedural and object-oriented languages. I’ll let those who are experts with functional languages to talk about functional languages, and in my experience, they will.

I am not teaching you how to use gprof (though it is rather easy). I am showing you how to do some of the optimizations I most frequently find myself doing, and the lessons to be learned from them.

Beware of Hidden Complexity!

Assume we are trying to optimize a small code snippet, as follows:

vector<int> a(vect_size), 
  b(vect_size), 
  result(vect_size);
 
/*
  Snip: Initialize data, perform initial calculations, etc.
*/
 
for(unsigned int i = 0; i < a.size(); ++i)
{
        result[j] = a[i] * b[i];
}

And you might find that the output for gprof would be something like this:

 %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
15.66      0.06     0.03  4000000     0.00     0.00 __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator+(long const&) const
12.53      0.08     0.02  2000002     0.00     0.00 __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >::base() const
9.40       0.09     0.02  4000000     0.00     0.00  std::vector<int, std::allocator<int> >::begin()
6.27       0.10     0.01  8000000     0.00     0.00 __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::__normal_iterator(int* const&)
6.27       0.11     0.01  4000000     0.00     0.00 __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator*() const
6.27       0.12     0.01  4000000     0.00     0.00  std::vector<int, std::allocator<int> >::operator[](unsigned long)
6.27       0.13     0.01  1000001     0.00     0.00  std::vector<int, std::allocator<int> >::end() const
6.27       0.14     0.01  1000001     0.00     0.00  std::vector<int, std::allocator<int> >::size() const
6.27       0.15     0.01  1000001     0.00     0.00  std::vector<int, std::allocator<int> >::begin() const
6.27       0.16     0.01        1    10.03   160.41  calculate(int)

These charts can be hard to read, but they extremely telling. The type of tasks that are taking up our processor time are: iterator manipulation.

But wait, we’re not using iterators!

Yes we are. Our calls to vector::operator[] are apparently taking up a lot more time than we ever could have imagined, and are using iterators in order to accomplish the task. Instead of using the index operator, let’s write the program the STL way and see what happens (Java, C, etc.. coders, take this as a general lesson instead of a language specific lesson!).

Maximize the Utility of Function/Method Calls

So we now replace all of the loop code with the following:

vector<int>::const_iterator a_iter = a.begin();
vector<int>::const_iterator b_iter = b.begin();
 
for(vector<int>::iterator r_iter = result.begin();
    r_iter != result.end();
    ++r_iter)
{
   *r_iter= (*a_iter) * (*b_iter);
   ++a_iter;
   ++b_iter;
}

and we get the following output from gprof:

20.88      0.06     0.03  1000001     0.00     0.00  std::vector<int, std::allocator<int> >::end()
16.70      0.08     0.02  1000001     0.00     0.00  bool __gnu_cxx::operator!=<int*, std::vector<int, std::allocator<int> > >(__gnu_cxx::__normal_iterator <int*, std::vector<int, std::allocator<int> > > const&, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > const&)
 8.35      0.09     0.01  3000004     0.00     0.00 __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::__normal_iterator(int* const&)
 8.35      0.10     0.01  2000000     0.00     0.00 __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >::operator++()
 8.35      0.11     0.01  1000001     0.00     0.00  std::vector<int, std::allocator<int> >::size() const
 8.35      0.12     0.01  1000000     0.00     0.00  std::vector<int, std::allocator<int> >::operator[](unsigned long)
 4.18      0.12     0.01  2000004     0.00     0.00 __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::base() const

For those who are too lazy to scroll across (myself included), the function taking up 20% of the CPU time is… vector::end()! What’s happening is that we’re constantly constructing and returning an iterator to the last element in r_iter, and it’s hurting us, time-wise. Since the end element never changes, it makes no sense to constantly recalculate it. We should store it for future generations of the loop to enjoy, much as one might put the most recent copy of People Magazine in a time capsule for their children to enjoy the tales of celebrity compound-name relationships.

So, we rewrite our loop as follows:

vector<int>::const_iterator a_iter = a.begin();
vector<int>::const_iterator b_iter = b.begin();
 
vector<int>::iterator r_iter = result.begin();
vector<int>::iterator r_end = result.end();
 
for(; r_iter != r_end; ++r_iter)
{
   *r_iter= (*a_iter) * (*b_iter);
   ++a_iter;
   ++b_iter;
}

and we observe another speed increase.

How much of a speed increase? The test program that I wrote in order to get these outputs took 5.11 seconds to run unoptimized. After profiling, it runs in 3.70 seconds. Not bad for a day’s work!

This hints on a broader lesson: reduce the crap in your loops as much as possible. Move it all as far outside as possible, and calculate it as few times as you can.

Be Nice to the Cache

While profiling can give us extraordinary speed increases solely by using innocuous code, it is not perfect. There is a much better way to write the above code that allows it to perform the same calculation with fewer cache misses. This rewrite would never ever be noticed solely through gprof.

This article is an excellent overview of how modern CPUs implement the cache, and it certainly taught this Jake new tricks. I recommend that you at least glance through it in order to truly understand how it works.

So how do we write the above loop to better use the cache?

struct calc_info
{
  int data1, data2, result;
};
 
/*
  snip
*/
 
vector<calc_info> data;
 
vector<calc_info>::iterator d_iter = data.begin();
vector<calc_info>::iterator d_end = data.end();
 
calc_info* a;
 
for(; d_iter != d_end; ++d_iter)
{
   a = &(*d_iter);
   a->result = a->data1 + a->data2;
}

The above code finishes in 2.31 seconds. Why does it finish so fast? In the previous version of the program, whenever it calculated one of the result residuals, it took a value from the first vector and added it to a value from the second vector. For sufficiently large vectors, the values being calculated wouldn’t be anywhere NEAR each other in the memory. They wouldn’t even be in the same galaxy. It might even need to pause and fetch data from the disk. That’s a HUGE speed killer. The same thing could happen when it stores the residual as a result. Cache misses galore!

The rewritten method puts all data that is being calculated next to each other. There is rarely the need for the cache to load new data, because all of the data was already transmitted to the cache, and is sitting in memory next to each other. The profiler never would have turned it up.

So should we avoid profiling?

No. We got some excellent speed increases without having to resort to unsafe arrays, and we did it in next to no time just by using a profiler!

Is profiling a complete substitute for experience?

No. We got a huge speed boost just by knowing how the cache works, and I’m 100% certain that I missed details that would let it perform in even less time.

Popularity: 23% [?]

Comments

2 Responses to “Basic Speed Optimizations for the Evil”

  1. Bob Miller on October 8th, 2007 10:21 am

    What platform are you using? I get roughly the opposite of your result: the last version is slowest, and the first version is fastest. gprof doesn’t show any million-call functions; they’ve all been inlined.

    The first version is also a huge win in readability.

    I’m using GCC 4.1.2 on Linux on a Core 2 Duo.

    Here’s how to reproduce my results.

    $ wget http://chezgeek.euglug.net/~kbob/jake.c++
    $ c++ -O2 jake.c++
    $ ./a.out
    add1: 3.02607 seconds
    add2: 3.0398 seconds
    add3: 5.85639 seconds
    $

    The whole test program is at that URL and should be self-explanatory.

  2. Jake on October 8th, 2007 5:59 pm

    If you run the test without optimization, you will see why I listed it as an improvement. I’ve been training for the ACM New York Regional Collegiate Programming Contest, and they don’t traditionally run any optimizations. This article was based on my experiences of training with that, and it didn’t occur to me to try to run it with optimization (since we’re not allowed, essentially), in which case, the method is certainly slower. Thank you for your time and effort.

Leave a Reply