Floating Point Basics

I love IEEE 754 floating point. It's one of the most ingenious and crucial parts of programming. It's also one of the most misunderstood. While people generally know how to use it, those who know how it works seem to be less common; unfortunate considering how error prone they can be. I'll do my best to explain the format in this post focusing solely on 32 bit normalized floats.

When the ^ character is used in this article it refers to "To the power of" and not "Exclusive or"

I should note before we get into this that, outside of a classroom, most of my knowledge of floating point representation comes from Bruce Dawson's Posts on the topic. I highly recommend reading them.

Scientific Notation

Floating point is essentially scientific notation for base 2. With that in mind let's have a very quick refresher on what scientific notation is.

Before we get started I should say there are many other resources on scientific notation, many (most?) of which will do a better job of explaining it than I will. This video from Khan Academy is should do a pretty good job.

Scientific notation is simply another way to represent real numbers. It follows the format of:

( - ) M * 10 ^ E

where M stands for mantissa and E stands for Exponent.

The mantissa is a decimal number [1, 10[. For example 1.0, 2.73344, 3.1415 are all acceptable mantissae; 0.5, 10.4312, 100.100 are not acceptable matissae.

[1, 10[ means any number from 1 to 10 including 1 but not including 10. It's often represented as [1, 10)

The Exponent is simply any integer.

This allows us to represent any real number, numbers greater than or equal to 10 will have a positive Exponent ( 11 for instance would be 1.1 * 10 ^ 1 ) while numbers less than 1 will have a negative Exponent ( 0.25 would be 2.5 * 10 ^ -1 ). Numbers in the range [1, 10[ will have an Exponent of 0. The benefits of this format as it pertains to base 10 numbers is beyond the scope of this article.

You'll often see the * 10 ^ E portion replaced with just the character 'e' or 'E' followed by the number. So 3.14 * 10 ^ 2 becomes 3.14e2.

Binary format

Let's transition this over to binary. If we take a look at our equation from before again with a small modification:

( -1 ) ^ ( S ) * M * 10 ^ E

( -1 ) ^ ( S ) is a convenient way of expressing "negative when set". If S is 0 then -1 ^ 0 == 1, if S is 1 then -1 ^ 1 == -1.

We see that there are in fact three part parts. The first part, which was largely ignored in the Scientific Notation section, is the sign which will be represented by a single bit. The second part is the exponent (E), this will be represented by 8 bits. That leaves us with 23 bits for the last part, the mantissa.

32 bits minus 1 for the sign and 8 for the exponent is 23

This gives us the binary format of:

s| |eeeeeeee| |mmmmmmmmmmmmmmmmmmmmmmm|

The equation itself is also going to change. We move from base 10 to base 2 and we swap the 10 for a 2.

( -1 ) ^ ( S ) * M * 2 ^ E

Let's look at how the binary format translates to this equation.

Sign

When the sign bit is set the number is negative. This is the same as multiplying by -1.

Exponent

The exponent is determined by pulling out the 8 exponent bits, taking the unsigned value of it (from 0 to 255) and subtracting 127 from it. Thus allowing the exponent to be negative.

Note: This is different from two's complement which is how signed integers are represented.

Mantissa

If you recall from above the mantissa is any number [1, 10[. Now that we're in base 2 that becomes [1, 2[ which means that the first digit is always a 1, we use that to our advantage by having an implicit 1 at the beginning of the mantissa giving us 24 bits of precision using 23 bits.

Special Values

We're ignoring a few special values here. Having exponent value of 0xFF gives us either infinity (if the mantissa bits are all set to 0) or various values to represent NaN (if any/all mantissa bits are set to 1). Having an exponent of 0x00 gives us either 0 (if the mantissa bits are all set to 0) or what is called a denormalized float which is beyond the scope of this article (math operations on denormalized floats are much slower and so denormalized floats are generally disabled in games.)

Bringing it all together

Taking all the information from above (ignoring the special values) we get the formula for interpreting the IEEE 754 32 bit floating point format. Where S is the sign bit, E is the unsigned byte value from the 8 exponent bits and M is the 23 mantissa bits.

value = ( ( -1 )^( S ) ) *
        1.M *
        (2 ^ ( E - 127));

1.M represents that we have an implicit bit due to the [1, 2[ interval of acceptable values for our mantissa. Another way of representing this would be (0x800000 + M).