Floating Point Limitations

It's common knowledge that floating point math is not perfect. In this article I hope to explain the issues you are most likely going to encounter. In a previous article I went over Floating Point Basics, if you haven't read it then I suggest you do (or read any of the many other resources on IEEE 754 floating point.)

Unrepresentable numbers

These are numbers that cannot be represented in base two regardless of precision. All integers can be represented so we will ignore those.

In the previous article we defined the floating point format as 3 parts: A sign, an exponent and a mantissa. Neither sign nor exponent will come into play in this section so we will be focusing solely on the mantissa which is essentially a binary number in decimal notation.

Decimal notation is a horrible term here, do not be confused by the implication that it is base 10. It refers to the existence of a decimal point ($a.bdc$) and has no necessary relation to base 10.

Decimal Notation as Fractions

Before we begin it is important that we understand that the set of numbers that can be represented by fractions is a superset of the set of numbers that can be represented in decimal notation. That is to say that any number that can be represented in decimal notation can be represented as a fraction but the opposite is not necessarily true.

We do not consider numbers that have a never ending repeating pattern to be representable in decimal notation ( e.g. $0.33333...$ )

Demonstrating this is fairly simple: If we set the denominator to the base to the power of the number of digits after the decimal point and set the numerator to the original number multiplied by the denominator we get a fraction. For the specific example $1.234$ we set our denominator to $10^3$ and our numerator to $1.234 * 10 ^ 3 = 1234$ which gives us $1234/1000$.

This leads us to our first set of unrepresentable numbers. If we can't represent a number in a fraction then we definitely can't represent it as a decimal point. This set of numbers is referred to as irrational numbers and includes numbers like $\pi$ and $e$.

Further explanation of irrational numbers is beyond the scope of this article. They are the least common of the unrepresentable numbers and, honestly, I don't know enough about them to do them justice.

Unrepresentable fractions in base 10 decimal notation

With those out of the way we'll look into what numbers can be represented as fractions but not as numbers in decimal notation. We'll first consider examples in base 10 as that's what most people are used to thinking in. To be precise the examples we'll look at now are:

$$1/3 = 0.333333...$$ $$1/10 = 0.1$$ $$1/2 = 0.5$$

One of these is not like the other. $1/3$ results in a repeating pattern which means that, for any finite precision, decimal format can not represent it. Why is that?

In order to answer that we'll first look at why $1/10$ representable? It's representable because decimal notation is in fact a summation of values multiplied by powers of the base. By that I mean for a decimal number $a.bcd$ the value of that number is:
$$(a * 10 ^ 0) + (b * 10 ^ {-1}) + (c * 10 ^ {-2}) + (d * 10 ^ {-3})$$

If we set $b = 1$ and everything else to $0$ the equation is reduced to
$$ 1 * 10 ^ {-1} = 10 ^ {-1} = 1/10 = 0.1$$

Moving on to our other representable number we're presented with a slightly more complicated example. The denominator for $1/2$ is not a power of $10$ but, the factors for the denominator ($2$) is a subset of the factors of $10$. Using this knowledge we just multiply both our denominator and our numerator by the missing factors in order to represent the fraction as one with a power of $10$ as the denominator. In this example, the factors for $10$ are $2$ and $5$, out denominator of $2$ is missing $5$ as a factor so we multiply both numerator and denominator by that:

$$ 1/2 * 5/5 = 5/10 = 0.5 $$

$5/5$ is of course just $1$ but we list is explicitly here to demonstrate our logic

What this means is that any fraction can be represented as in decimal notation as long as it can be represented by a fraction with a power of the base as the denominator. This occurs when:

  • The denominator is already a power of the base
  • The denominator's factors are a subset of the factors of a power of the base.

A good example of this is $3/15$. $15$'s factors are $3$ and $5$, these are not a subset of the factors of any power of the base ($10$). However $3/15$ can also be represented as $1/5$ which has a denominator whose factor ($5$) is a subset of the factors of the base so we multiply the numerator and denominator by the missing factors.

$$3/15 = 1/5 = 1/5 * 2/2 = 2/10 = 0.2$$

Application to Binary numbers

Returning to the world of IEEE 754 floating point means returning to a base $2$ world. Everything is simpler here because our base has only one factor (excluding $1$), itself. Quickly going through the numbers above:

IEEE 754 floating point actually includes base 10 variants but we will conveniently ignore these as they are very rarely seen.

  • $1/3$: the factors of the denominator are $1$ and $3$, this isn't a subset of the factors of our base so we know this can't be represented
  • $1/2$: the factors of the denominator are $1$ and $2$, this is the set of the factors of our base so we know this can be represented
  • $1/10$: the factors of the denominator are $1$, $2$ and $5$ this is a superset of the factors of our base so we know this cannot be represented.

Now, what does this mean for these numbers when we use them in code? Well, for $1/2$ it means very little, this can be represented so math using it will be accurate unless we run into precision problems (covered later). For $1/3$ and $1/10$ though, it means that when we try to use these numbers, we're instead going to be using numbers close to but not exactly them. Over time this could cause errors.

float foo = 1.1f; /* As we know from above .1f can not be represented in binary */
bool correct = (foo + foo ) == 2.2f; /* The number closest to 1.1f is actually half the number closest to 2.2f */
bool incorrect = (foo + foo + foo) == 3.3f; /* Here's where it gets tricky, the number closest to 3.3f is actually less than the number closest to 1.1f summed with itself thrice*/

It shouldn't be surprising that $2.2$ and $1.1 + 1.1$ end up being equal. $2.2$ estimation would be exactly the same estimation as $1.1$ except with the exponent one higher.

The trick here is to almost never use equality on floats, always check to see if they are close to each other instead of straight equality.

The last note I'll say on unrepresentable numbers is just how often they occur in code. The factors for $10$ are $1$, $2$ and $5$. If we look at any decimal value as a numerator $N$ and a denominator $D$ where $D$ is always a power of $10$ then $N$ needs to be divisible by $5^{\log D}$. In laymans terms, for every decimal point in base 10, the value after the decimal point needs to be divisible by another 5. $0.x$ needs only be divisible by 5 ( So $0.5$ is the only possible solution) whereas $0.xx$ needs to be divisible by $ 5 * 5 = 25$ ( so $0.25$, $0.50$ and $0.75$ are all possible ).