Floating Point Foibles
Posted on February 2, 2008
Filed Under Computer Science, Programming, Software Development |
At the end of the day, calculation is one of the most important parts of programming. It’s why computers were invented: a computer can work 24 hours a day on any problem with extremely high accuracy. Humans are slower, less accurate, and far more likely to complain about not being fed or having breaks.
The tool of the trade is the floating point number, and it is important to know their strengths and weaknesses. To get a better handle on errors, we need to know how they are represented.
Please note that all examples will be done with the float data type, and all errors presented below are also applicable to doubles.
Representing floating point numbers
Numbers can be expressed a multitude ways. Any programmer worth his salt can quickly tick off four ways that the number 16 (base 10) can be represented in different bases.
There are more ways to represent different numbers than in different bases. In addition to number systems, it is possible to generalize any number to the following:
A General Representation of Numbers
Scientific Notation is stored in this fashion. To use an example,
-.000314 = -3.14 * 
This has a sign, (it’s negative), a mantissa (3.14), a base (10), and an exponent (-4).
Computers represent information in a very similar manner, but with a few modifications: the base of all floating point numbers is fixed to 2, and the mantissa can only be between zero and one. So our floating point model now looks something like this:
Simplified Computer Floating Point Model
Where the mantissa is in the interval [0, 1)
But there is a detail that we need to take care of. The mantissa can be between 0 and 1 (without being 1), and we need to be able to represent very small quantities as well as very large quantities with the exponent. We could store the exponent as the "twos-complement" number, but then in order to compare exponents, we would need to do extra work than direct comparison. So we use a method called "Biasing" to offset the problem, so that the number is always stored as a direct integer.
Better Computer Floating Point Model
Where the mantissa is in the interval [0, 1), and the sign is stored as one bit, exponent as 8 bits, and the mantissa is 23 bits.
In order to always produce unique numbers, we need to define that the highest order bit of the mantissa is always one. Since it will always be one, there is no need to store the number, giving us an extra bit of precision.
Loss of Precision Errors
Suppose we are given the task of calculating the following function:
This might cause you grief:
Aside from being careful at x=0, this seems like its a perfectly decent calculation, right?
To see where this calculation might go wrong, let’s consider an easier problem with an easier number system:
Calculation in a simple number system
Our number system will just be a single decimal place and three fixed-point values. When we have extra precision, we will truncate instead of round. One such number is 1.234. Let’s look at the problem
.
For some values of x, such as 2.000, we get
However, let’s look at our value for x = 1.001:
Whether or not our number system rounds or truncates, we just accumulated an error. After only one calculation, we were off by .001. If we did a thousand calculations, we could be off by 1!
So what happened? As the name of the section promises, we lost precision. At every step, we needed to truncate the result to x.xxx in order to match our results, and by the end the errors accumulated to be significant enough to cause us problems. This mainly occurs when we are subtracting two values that are very close to each other in value (or adding two values that are of opposite signs, where the absolute value is small). In the simple example, I exacerbated the situation by also dividing by a value close to zero. Yes, I’m a jerk, but you have to admit that the equation looked pretty innocent.
When subtracting, it is important to rearrange equations so that you are not subtracting values that are very close to each other. Otherwise, you will introduce error into your calculations. For example,
can be rewritten as
and reduce the error introduced.
Magnitude Error
Let’s do a common task: compute the following sum:
This might cause you grief:
There are a bunch of ways that you can compute this, but I will pick two (contrived, but possible) versions that will help me prove my point. Let’s say that we want to sum all of the numbers from 1 to 7000. We will do it from 1 to 7000 and from 7000 to 1. We will compare the answers with the result of the closed-form evaluation
.
From 7000 to 1: 24505464 From 1 to 7000: 24502896 Actual: 24503500
Download the code for the example
Ouch, there is a difference! We’re just adding numbers, dammit! Why is this different?
It is different because floating point numbers are only able to correctly display precision to a certain degree. If we’re adding 1,000,000,000 and .0000000000001, the result is 1,000,000,000.0000000000001, and it’s easy to see that the truncation is going to occur at some point before the least significant digit of information. There is going to be a decision that is made whenever you need to truncate whether or not to round, and in this case, it appears that the extra rounding is giving us a positive error.
We should also note that there is less error summing from 1 to 7000 than there is from 7000 to 1, and that will always hold true in general. When you sum from smallest to largest, you will need to round less than when you sum from largest to smallest.
In order to reduce your floating point error when summing, always sum your inputs from smallest to largest.
Other Errors
There are some errors that are more obvious that can occur, so I will list them here for sake of completeness.
- Overflow and Underflow: Dividing by a number that is very close to zero, for example, may make your result become larger than what floating point numbers can represent. The floating point number has overflowed, and will be represented by a special value that means that the number is at or “near” infinity. Likewise, one might multiply by a number that is so small that the floating point number can’t be represented. The number will then be represented by either a very small value, or by zero.
- Approximation Error: Some numbers can’t be represented exactly in floating point, so they are stored as approximations.
- Output Error: Conversion from binary to decimal can introduce small errors (attempt to correctly display numbers that can’t be quantized, for example) that can be compounded if the conversion is frequent, or happens a lot over time.
Popularity: 21% [?]
Comments
One Response to “Floating Point Foibles”
Leave a Reply





.


[...] bookmarks tagged numbers Floating Point Foibles saved by 5 others kleptodog bookmarked on 02/03/08 | [...]