Cleverly Named Programming Blog

Radix Sort

Matthew Bell — Sun, 30 Jul 2017 02:25:32 GMT

Radix Sort has always been my favourite sort. I learned it while infatuated with bitwise operation and the arcane tricks that I was learning at the time. It also stands out as being a sort without any comparison between elements.

We'll explore using Radix sort with 3 different types of data: the supersimple unsigned 32 bit integer, the slightly less simple signed 32 bit integer and the fun 32 bit floating point. We'll discuss how Radix sort works and use it as an example of how knowing the format of the data you're working with can come in useful.

The theory behind it

The Radix Sort implementation we'll be looking at is a simple idea that is best described as sorting into buckets using a subset of each element as a key at a time. The easiest way to demonstrate this is sorting base 10 numbers one digit at a time. For base 10 digits this means we'll start with the right most digit first as it represents the Least Significant value; we'll discuss why later.

The term significance is going to be used a lot. Significance describes how much of an effect a portion of an element has in relation to other portions. In base 10, the digit in the 100s position has more significance than the digit in the 10s position as it contributes more to the value of the number as a whole. Said another way a 1 in the 100s position is more than a 1 in the 10s position. It functions the same for bits and bytes, values further "left" have greater significance.

We start with the following values 10, 45, 100, 9, we then go through each and sort based off the first digit keeping the order consistent when values share the same digit (as is the case for 10 and 100, both have 0 in the right most position). We do this for each digit, treating any non-present digit as 0 (Meaning, for instance, in 9 we treat the second digit from the right as 0):

010, 045, 100, 009
010, 100, 045, 009
100, 009, 010, 045
009, 010, 045, 100

We start with the Least Significant value because our algorithm relies on the stable nature of each pass of the sort. What this means is that each pass will not change the order of values that have the same value. If we had 46 in the above example, it would have been sorted after 45 in the first pass and then the second and third passes would not have changed it's position in relation to 45. 36 on the other hand would start being sorted after 45 in the first pass but it's position would be corrected in the second pass. If you're unconvinced, I suggest using the examples above and sorting by the leftmost digit first, you'll end up with 100, 010, 045, 009

Our algorithm iterates over each byte and, within that iteration, iterates over each element twice. I'll endeavour to refer to the internal element iterations as first and second "pass" or "per-byte pass". What this breaks down as is a structure of for(bytes) { for (elements) {} for (elements){} }

Now all we need to do is extend that to the specific data that we have.

Unsigned Int

Why unsigned first

A lot of people dissuade programmers from using unsigned ints. I don't agree with the advice, I'd say that you should never (unless you know exactly what you're doing) mix signed and unsigned ints. When separated however, they each have their own advantages.

The plus side to unsigned integers is they behave how you expect when doing bitwise operations and, of the data types we're considering, they are the least abstracted from their raw data layout. We'll discuss these differences more when we're dealing with signed integers. It's this low level of abstraction that makes them the best first data type to implement this with.

We know how Radix Sort is supposed to work on base 10 numbers: going digit by digit. In our implementation we're going to go byte by byte instead of digit by digit. There's no reason you couldn't divide up your elements differently, in the above base 10 sample you could, for instance, do 2 digits at a time. Bytes are nice in that it's easy for others to determine what we're doing (x & 0xffu is a very common operation) and they are large enough to speed up the algorithm significantly.

x & 0xffu is a bitwise and against the Least Significant Byte. What this means is that it will keep whatever value is in that byte and set all other bytes to 0. The u at the end of 0xffu designates the literal as unsigned.

There's also the matter of concrete implementation details. The way we're going to approach dividing things is to do two sub passes per byte. The first pass will go through and keep track of how many elements belong in each bucket. We will then take this counts array and turn it into an array of starting indices. With that in hand we will go through our elements again (pass two now) and place them into the correct index based on their bucket, incrementing that bucket's index each time something is place in that bucket.

A good way to understand this two pass approach is to consider the case where we're sorting unsigned bytes instead of 32 bit unsigned integers. In that case you could just count the 0s, 1s, 2s...255s and then just store however many 0s there are then however many 1s there were and so on and so forth in your array. That's essentially what we do on a per byte basis but we also preserve extra information (the other 3 bytes on the 32 bit unsigned integer)

With all that in mind, here's the code

void RadixSort(u32* intArr, u32 numInts)
{
    // We will store the counts in here and then "accumulate" them in order to
    // turn them into iterators
    u32 iters[256];
    // Our implementation is not inplace so we need a swap buffer
    u32* swap = new u32[numInts];

    // 4 passes, one per byte
    for (u32 idx = 0; idx < 4; ++idx)
    {
        // Set all iterators/counts to 0
        ClearCounts(iters);
        // Increment count for corresponding bucket
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
            iters[bucket]++;
        }

        // Translate counts into iterator
        AccumulateCounts(iters);
        // Place values at their proper spot based off its buckets iterator
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 value = intArr[jdx];
            u32 bucket = (value >> (idx * 8)) & 0xffu;

            u32 loc = iters[bucket]++;
            swap[loc] = value;
        }

        // Swap arrays to get ready for next pass
        u32* temp = intArr;
        intArr = swap;
        swap = temp;
    }

    // Clean up allocated memory
    delete[] swap;
}

Now, that's some pretty dense code so don't worry if you don't understand it right away. Let's step through it slowly.

External Setup

First we have some simple setup. Our algorithm includes keeping track of groupings of bytes, there are 256 possible values for a single byte so we create an array of 256 unsigned integers in order to keep track of their counts. The algorithm also isn't in place so we need to create a new buffer that we can write to while reading from our original array.

// We will store the counts in here and then "accumulate" them in order to
// turn them into iterators
u32 iters[256];
// Our implementation is not inplace so we need a swap buffer
u32* swap = new u32[numInts];

Next we are going to do the sub-algorithm for each byte in the elements. Our elements are unsigned 32 bit integers which have 4 bytes (a byte being 8 bits) each, therefore we need to iterate 4 times: once per byte.

// 4 passes, one per byte
for (u32 idx = 0; idx < 4; ++idx)

First Per-byte Pass

Our per-byte algorithm can be split into two iterations over the elements. Before our first pass we clear out our counts buffer, this is simply iterating through and zero-ing out every element on it (we don't show the code here due to its trivial nature). Once done we iterate through every element, grab the appropriate byte and increment the count for that byte

// Set all iterators/counts to 0
ClearCounts(iters);
// Increment count for corresponding bucket
for (u32 jdx = 0; jdx < numInts; ++jdx)
{
    u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
    iters[bucket]++;
}

Before we go any further let's look at u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu; in a bit more depth. This is how we're grabbing the appropriate byte. In order of operations we have idx * 8, this is simply to modify the next step from a bitwise operation to a bytewise: a byte is simply 8 bits. Next intArr[jdx] >> (idx * 8), here we are grabbing the element in question intArr[jdx] and then shifting it to the right by as many bytes as our outer loop index. This means that in each outer loop iteration will have the appropriate byte shifted into the Least Significant Byte position. Which brings us to the last step & 0xffu will mask out all but the Least Significant Byte, all other bytes will be 0.

To summarize: We calculate the right shift necessary to bring the appropriate byte into the Least Significant Byte position, we execute that right shift and then we set all but the Least Significant Byte to 0.

Accumulation

Next we accumulate the counts in such a way that instead of having counts in the counts buffer we have iterators.

// Translate counts into iterator
AccumulateCounts(iters);

Unlike the ClearCounts function we will quickly go over the AccumulateCounts function.

void AccumulateCounts(u32 counts[256])
{
    u32 increment = counts[0];
    counts[0] = 0;
    for (u32 idx = 1; idx < 256; ++idx)
    {
        u32 nextIncrement = counts[idx];
        counts[idx] = counts[idx - 1] + increment;
        increment = nextIncrement;
    }
}

The purpose of this function is to convert counts into iterators. We do this by setting every element to the sum of all the previous elements. In order to achieve this we store the count of the index in question into a temporary variable, we then set the value at the index to the previous iterations result (or 0 in the case of the first element) plus the increment variable from the previous iteration and finally store our temporary variable into the increment variable for the next iteration.

As an example let's look at our decimal numbers from before: 010, 045, 100, 009. After the first pass we would have a counts array of 2 (010 and 100), 0, 0, 0, 0, 1 (045), 0, 0, 0, 1 (009). After AccumulateCounts we would have a counts array of 0, 2, 2, 2, 2, 2, 3, 3, 3, 3. This would allow us to store 010 and 100 in the first 2 indices (0 and 1), 45 is the 3rd index (2) and 009 in the last index(3). If you don't immediately understand this don't worry, the next bit of code we are looking at will implement this.

Second Per-byte iteration

// Place values at their proper spot based off its buckets iterator
for (u32 jdx = 0; jdx < numInts; ++jdx)
{
    u32 value = intArr[jdx];
    u32 bucket = (value >> (idx * 8)) & 0xffu;

    u32 loc = iters[bucket]++;
    swap[loc] = value;
}

Above is the second pass of our per byte algorithm. For each element, we find which bucket it belongs to using the calculation you saw in the first pass (value >> (idx * 8)) & 0xffu;, we use that value to lookup into the counts buffer, grab the current value and then increment it. This is what is meant when we say the counts buffer becomes a buffer of iterators. We use the value we just grabbed as an index into the swap buffer and set the value at that index to the element of the array we're processing.

Setup for next iteration and cleanup

The last part of the per byte algorithm is simply to swap the buffers to that they are ready for the next iteration

// Swap arrays to get ready for next pass
u32* temp = intArr;
intArr = swap;
swap = temp;

And once we have processed each byte we free the memory we previously allocated at the beginning of the function.

// Clean up allocated memory
delete[] swap;

Deeper dive into our decimal example

With all this in mind let's look at our decimal example again and go through the algorithm in a little more detail. We start with the same 4 numbers as before: 010, 045, 100, 009.

First our counts buffer, in this case it's just an array of 10 as there are only 10 different digit values (as opposed to 256 byte values). We go through and count the number of times each digit occurs which gives us:
2 (010 and 100), 0, 0, 0, 0, 1 (045), 0, 0, 0, 1 (009).

Next we turn this into an accumulation buffer using the algorithm we went over above:
0, 2, 2, 2, 2, 2, 3, 3, 3, 3

We now go through our elements again.

With 010 we look at the 0th element of the counts array, use that as our index into the swap buffer and then increment it. Resulting in the following setup:
Swap = {010, xxx, xxx, xxx}
Counts = {1, 2, 2, 2, 2, 2, 3, 3, 3, 3}

With 045 we do the same with the 5th element, which is originally a 2:
Swap = {010, xxx, 045, xxx}
Counts = {1, 2, 2, 2, 2, 3, 3, 3, 3, 3}

100 is similar to 010 but the value at the 0th index is now 1 instead of 0:
Swap = {010, 100, 045, xxx}
Counts = {2, 2, 2, 2, 2, 3, 3, 3, 3, 3}

Finally with 009:
Swap = {010, 100, 045, 009}
Counts = {2, 2, 2, 2, 2, 2,, 3, 3, 3, 4}

Then on to the next iteration but using the previous Swap buffer as our array and looking at the next element

Counts: 2 (100 and 009), 1(010), 0, 0, 1 (045), 0, 0, 0, 0, 0
After Accumulation: 0, 2, 3, 3, 3, 4, 4, 4, 4, 4

010
Swap = {xxx, xxx, 010, xxx}
Counts = {0, 3, 3, 3, 3, 4, 4, 4, 4, 4}

100
Swap = {100, xxx, 010, xxx}
Counts = {1, 3, 3, 3, 3, 4, 4, 4, 4, 4}

045
Swap = {100, xxx, 010, 045}
Counts = {1, 3, 3, 3, 4, 4, 4, 4, 4, 4}

009
Swap = {100, 009, 010, 045}
Counts = {2, 3, 3, 3, 4, 4, 4, 4, 4, 4}

And the last digit
Counts: 3 (009 and 010 and 045), 1(100), 0, 0, 0, 0, 0, 0, 0, 0
After Accumulation: 0, 3, 4, 4, 4, 4, 4, 4, 4, 4

100 Swap = {xxx, xxx, xxx, 100}
Counts = {0, 4, 4, 4, 4, 4, 4, 4, 4, 4}

009 Swap = {009, xxx, xxx, 100}
Counts = {1, 4, 4, 4, 4, 4, 4, 4, 4, 4}

010 Swap = {009, 010, xxx, 100}
Counts = {2, 4, 4, 4, 4, 4, 4, 4, 4, 4}

045 Swap = {009, 010, 045, 100}
Counts = {3, 4, 4, 4, 4, 4, 4, 4, 4, 4}

And there we go, a sorted array without a single comparison between the elements. Now, let's look at how we can modify it to work with signed integers and, finally, floats.

Signed ints

We mentioned above that we'd use unsigned integers first because they are less of an abstraction away from the actual data than signed integers. Specifically, an unsigned integer with a larger value in the Most Significant Byte is larger than one with a smaller value in the Most Significant Byte. Signed integers, however, have a slight quirk at the Most Significant bit, let's look at that.

Two's Complement

Signed integers come with the added complexity of needing to represent negative numbers. The abstraction ints use is called Two's Complement and all it does is negate the most significant bit. So, if we had a three bit unsigned int we'd have bit values 4 2 1 but our three bit signed int would have bit values -4 2 1, giving us the below values:

bits = unsigned = signed
000 = 0 = 0
001 = 1 = 1
010 = 2 = 2
011 = 3 = 3
100 = 4 = -4
101 = 5 = -3
110 = 6 = -2
111 = 7 = -1

This may seem weird but there are some benefits. First and most obviously, positive numbers have the same representation as their unsigned variant (As you can see in the first four entries above). Secondly, besides the the transition from 3 to -4, all values increase as the bits increase. That's all that really matters for the sake of this article.

The code

Knowing that only the last bit behaves differently for signed integers we can safely run the same algorithm as before over the Least Significant 3 bytes.

The difference is how we treat that last byte.

void RadixSort(i32* signedIntArr, u32 numInts)
{
    // We will store the counts in here and then "accumulate" them in order to
    // turn them into iterators
    u32 iters[256];
    // Our implementation is not inplace so we need a swap buffer
    u32* swap = new u32[numInts];
    // We're going to treat our signedIntArr as an unsigned one
    u32* intArr = reinterpret_cast(signedIntArr);

    // 3 passes, one per byte.  The last byte needs to be modified to account
    // for two's complement
    for (u32 idx = 0; idx < 3; ++idx)
    {
        // Set all iterators/counts to 0
        ClearCounts(iters);
        // Increment count for corresponding bucket
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
            iters[bucket]++;
        }

        // Translate counts into iterator
        AccumulateCounts(iters);
        // Place values at their proper spot based off its buckets iterator
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 value = intArr[jdx];
            u32 bucket = (value >> (idx * 8)) & 0xffu;

            u32 loc = iters[bucket]++;
            swap[loc] = value;
        }

        // Swap arrays to get ready for next pass
        u32* temp = intArr;
        intArr = swap;
        swap = temp;
    }

    // We modify out last pass in order to account for two's complement.
    // What this boils down to is swapping the Most Significant bit of 
    // every element when doing bucket calculations
    ClearCounts(iters);
    for (u32 jdx = 0; jdx < numInts; ++jdx)
    {
        u32 bucket = (intArr[jdx] >> (24)) & 0xffu;
        bucket ^= 0x80;
        iters[bucket]++;
    }

    AccumulateCounts(iters);
    for (u32 jdx = 0; jdx < numInts; ++jdx)
    {
        u32 value = intArr[jdx];
        u32 bucket = (value >> 24) & 0xffu;
        bucket ^= 0x80;

        u32 loc = iters[bucket]++;
        swap[loc] = value;
    }

    u32* temp = intArr;
    intArr = swap;
    swap = temp;

    // Clean up allocated memory
    delete[] swap;
}

The code is near identical, there is in fact only one real differences in how we treat that last byte. In each pass we modify the "bucket" with this calculation ^= 0x80u. All that does is flip the Most Significant bit of the byte which, in the case of the Most Significant Byte, is the Most Significant bit of the whole integer. Keep in mind this calculation only affects our lookup, it does not modify the value that we actually store.

There are a lot more lines of code for "The code is near identical", this is because we pulled out one of our byte iterations, if you look the code outside of the first for loop looks a lot like the code within it

Let's look at our 3 digit signed integer and see how this works. Lets take the number -3 2 and 0 or in binary 101, 010, 000. If we flip the most significant bit we get -3 = 001, 2 = 110 and 000 = 100 which matches the order we expect -3 < 0 < 2 -> 001 < 100 < 110.

Now, let's take it one step further.

Floating Points

I like to think I have a decent couple articles going over floating point, I would suggest you read the Floating Point Basics before reading this section. Alternatively, there are many (better) resources going over FP, Bruce Dawson's articles being my favourite.

What we care about in regards to Floating Point representation is that it doesn't quite work the same way twos complement works. Positive numbers can be sorted the same way (as signed or unsigned integers) because the exponent, which determines the size of a number more than the mantissa does, is in a more Significant position. The issue is, unsurprisingly, with negative numbers. Instead of subtracting the value of the Most Significant bit, floating point mirrors based on that bit.

1.0f is stored as 0x3f800000, in order to get -1.0f you just flip the most significant bit which gives you 0xbf800000, 2.0f is stored as 0x40000000 and -2.0f is 0xc0000000. As you can see, while -2.0f is less than -1.0f, the value when interpreted as an unsigned integer is greater (0xc0000000 > 0x40000000); this wasn't true of -2 and -1 in signed integer (0xfffffffe and 0xffffffff respectively).

All this means however, is that we use our signed integer approach and then do some quick post processing on the negative numbers.

void RadixSort(float* floatArr, u32 numFloats)
{
    // We will store the counts in here and then "accumulate" them in order to
    // turn them into iterators
    u32 iters[256];
    // Our implementation is not inplace so we need a swap buffer
    u32* swap = new u32[numFloats];
    // We're going to treat our floatArr as an unsigned one
    u32* intArr = reinterpret_cast(floatArr);

    // 3 passes, one per byte.  The last byte needs to be modified to account
    // for floating point representation
    for (u32 idx = 0; idx < 3; ++idx)
    {
        // Set all iterators/counts to 0
        ClearCounts(iters);
        // Increment count for corresponding bucket
        for (u32 jdx = 0; jdx < numFloats; ++jdx)
        {
            u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
            iters[bucket]++;
        }

        // Translate counts into iterator
        AccumulateCounts(iters);
        // Place values at their proper spot based off its buckets iterator
        for (u32 jdx = 0; jdx < numFloats; ++jdx)
        {
            u32 value = intArr[jdx];
            u32 bucket = (value >> (idx * 8)) & 0xffu;

            u32 loc = iters[bucket]++;
            swap[loc] = value;
        }

        // Swap arrays to get ready for next pass
        u32* temp = intArr;
        intArr = swap;
        swap = temp;
    }

    // We modify out last pass in order to account for fp representation
    // What this boils down to is swapping the Most Significant bit of 
    // every element when doing bucket calculations.  This is the same
    // as two's complement however, we will run more logic afterwards to
    // account for the differences between two's complement and fp representation
    ClearCounts(iters);
    for (u32 jdx = 0; jdx < numFloats; ++jdx)
    {
        u32 loc = (intArr[jdx] >> (24)) & 0xffu;
        loc ^= 0x80;
        iters[loc]++;
    }
    AccumulateCounts(iters);
    for (u32 jdx = 0; jdx < numFloats; ++jdx)
    {
        u32 value = intArr[jdx];
        u32 bucket = (value >> 24) & 0xffu;
        bucket ^= 0x80;

        u32 loc = iters[bucket]++;
        swap[loc] = value;
    }

    u32* temp = intArr;
    intArr = swap;
    swap = temp;

    // Clean up allocated memory
    delete[] swap;

    // In order to account for negative floating points decreasing in
    // value (-2.0f < -1.0f) while the bit representation increases 
    // (0xc0000000 > 0xbf800000) we need to mirror our array over the mid
    // point
    u32 firstPositive = iters[0x7f];
    for (u32 idx = 0; idx < firstPositive/2; ++idx)
    {
        u32 temp = intArr[idx];
        intArr[idx] = intArr[firstPositive - idx - 1];
        intArr[firstPositive - idx - 1] = temp;
    }
}

The important part is right at the end. First we want to iterate only over negative numbers so we grab the iterator for the last negative byte value (0x7f, remember we flip the 0x80 bit in our algorithm meaning we're looking at values without the bit set). A side effect of our algorithm is that the iterator of the last negative number will be the index of the first positive number.

We then iterate from the beginning to the halfway point, switching element at index idx with the element at index firstPositive - 1 - idx; firstPostive - 1 being another way of saying "last negative". And we're done, surprisingly doing this postprocess is faster than calculating the correct index for negative numbers in the last pass. Maybe we'll look into that in a future article.

Wrap up

What makes Radix Sort so interesting is it's main pros and cons: It's one of the best scaling sort but, it's not general purpose. We were able to extend it to signed integers and floating point with relative ease but arbitrary classes wouldn't be possible: you can't really provide a comparison algorithm for a sort that doesn't compare.

In regards to scaling, there isn't much to say, Radix sort has no worst case and scales at O(nk) where k is the number of bytes. This closely matches O(n * log(n)), in tests Radix Sort ran about 4x as fast as Quicksort (which is 0(n * log(n))) using randomly generated sets, simulating Worst case however has Quicksort (O(n^2)) scaling at a much worse rate, running 1000x as slow as Radix at 100000 elements and almost 2500x slower at 300000.

Regular Case:
- 100000
  - Radix: 1.96ms
  - Quick : 8.40ms
- 200000
  - Radix: 3.5ms
  - Quick: 17.3ms
- 300000
  - Radix: 5.82ms
  - Quick: 26.71ms
- 1000000
  - Radix: 19.34ms
  - Quick: 78.64ms
- 10000000
  - Radix: 198.01ms
  - Quick: 795.10ms
- 100000000
  - Radix: 1837.22ms
  - Quick: 7380.67ms
Worst Case
- 100000
  - Radix: 1.95ms
  - Quick : 1740.75ms
- 200000
  - Radix: 3.7ms
  - Quick: 6535.66ms
- 300000
  - Radix: 5.90ms
  - Quick: 14851.15ms

These numbers are somewhat useless on their own but at least show a rough comparison between a Radix Sort and a Quick Sort. Quick Sort has many benefits on it's own and would be a great topic for another time, it was chosen here just because it's Worst Case was very easy to simulate (already sorted list) and it's general prevalence.

I suppose an important question to answer is "When do I use Radix Sort?" and the answer is basically: Whenever you can for large enough lists. Radix Sort has some fixed cost (clearing/accumulating the iters buffer) that make it not ideal for very small numbers of elements. However the that quickly becomes irrelevant. The main downside of Radix sort is its inflexibility. You won't always be able to use it.

So there we are, hopefully at this point you have a pretty good idea of how Radix Sort (or at least this implementation of it) works. Also worth pointing out is that we were able to extend this algorithm to both signed integers and floats by understanding their bit layout. While going through programming just accepting things work a specific way is definitely possible, there is always a benefit from spending time to understand the minutiae.

Floating Point Limitations

Matthew Bell — Mon, 18 Jul 2016 06:12:14 GMT

It's common knowledge that floating point math is not perfect. In this article I hope to explain the issues you are most likely going to encounter. In a previous article I went over Floating Point Basics, if you haven't read it then I suggest you do (or read any of the many other resources on IEEE 754 floating point.)

Unrepresentable numbers

These are numbers that cannot be represented in base two regardless of precision. All integers can be represented so we will ignore those.

In the previous article we defined the floating point format as 3 parts: A sign, an exponent and a mantissa. Neither sign nor exponent will come into play in this section so we will be focusing solely on the mantissa which is essentially a binary number in decimal notation.

Decimal notation is a horrible term here, do not be confused by the implication that it is base 10. It refers to the existence of a decimal point ($a.bdc$) and has no necessary relation to base 10.

Decimal Notation as Fractions

Before we begin it is important that we understand that the set of numbers that can be represented by fractions is a superset of the set of numbers that can be represented in decimal notation. That is to say that any number that can be represented in decimal notation can be represented as a fraction but the opposite is not necessarily true.

We do not consider numbers that have a never ending repeating pattern to be representable in decimal notation ( e.g. $0.33333...$ )

Demonstrating this is fairly simple: If we set the denominator to the base to the power of the number of digits after the decimal point and set the numerator to the original number multiplied by the denominator we get a fraction. For the specific example $1.234$ we set our denominator to $10^3$ and our numerator to $1.234 * 10 ^ 3 = 1234$ which gives us $1234/1000$.

This leads us to our first set of unrepresentable numbers. If we can't represent a number in a fraction then we definitely can't represent it as a decimal point. This set of numbers is referred to as irrational numbers and includes numbers like $\pi$ and $e$.

Further explanation of irrational numbers is beyond the scope of this article. They are the least common of the unrepresentable numbers and, honestly, I don't know enough about them to do them justice.

Unrepresentable fractions in base 10 decimal notation

With those out of the way we'll look into what numbers can be represented as fractions but not as numbers in decimal notation. We'll first consider examples in base 10 as that's what most people are used to thinking in. To be precise the examples we'll look at now are:

$$1/3 = 0.333333...$$ $$1/10 = 0.1$$ $$1/2 = 0.5$$

One of these is not like the other. $1/3$ results in a repeating pattern which means that, for any finite precision, decimal format can not represent it. Why is that?

In order to answer that we'll first look at why $1/10$ representable? It's representable because decimal notation is in fact a summation of values multiplied by powers of the base. By that I mean for a decimal number $a.bcd$ the value of that number is:
$$(a * 10 ^ 0) + (b * 10 ^ {-1}) + (c * 10 ^ {-2}) + (d * 10 ^ {-3})$$

If we set $b = 1$ and everything else to $0$ the equation is reduced to
$$ 1 * 10 ^ {-1} = 10 ^ {-1} = 1/10 = 0.1$$

Moving on to our other representable number we're presented with a slightly more complicated example. The denominator for $1/2$ is not a power of $10$ but, the factors for the denominator ($2$) is a subset of the factors of $10$. Using this knowledge we just multiply both our denominator and our numerator by the missing factors in order to represent the fraction as one with a power of $10$ as the denominator. In this example, the factors for $10$ are $2$ and $5$, out denominator of $2$ is missing $5$ as a factor so we multiply both numerator and denominator by that:

$$ 1/2 * 5/5 = 5/10 = 0.5 $$

$5/5$ is of course just $1$ but we list is explicitly here to demonstrate our logic

What this means is that any fraction can be represented as in decimal notation as long as it can be represented by a fraction with a power of the base as the denominator. This occurs when:

The denominator is already a power of the base
The denominator's factors are a subset of the factors of a power of the base.

A good example of this is $3/15$. $15$'s factors are $3$ and $5$, these are not a subset of the factors of any power of the base ($10$). However $3/15$ can also be represented as $1/5$ which has a denominator whose factor ($5$) is a subset of the factors of the base so we multiply the numerator and denominator by the missing factors.

$$3/15 = 1/5 = 1/5 * 2/2 = 2/10 = 0.2$$

Application to Binary numbers

Returning to the world of IEEE 754 floating point means returning to a base $2$ world. Everything is simpler here because our base has only one factor (excluding $1$), itself. Quickly going through the numbers above:

IEEE 754 floating point actually includes base 10 variants but we will conveniently ignore these as they are very rarely seen.

$1/3$: the factors of the denominator are $1$ and $3$, this isn't a subset of the factors of our base so we know this can't be represented
$1/2$: the factors of the denominator are $1$ and $2$, this is the set of the factors of our base so we know this can be represented
$1/10$: the factors of the denominator are $1$, $2$ and $5$ this is a superset of the factors of our base so we know this cannot be represented.

Now, what does this mean for these numbers when we use them in code? Well, for $1/2$ it means very little, this can be represented so math using it will be accurate unless we run into precision problems (covered later). For $1/3$ and $1/10$ though, it means that when we try to use these numbers, we're instead going to be using numbers close to but not exactly them. Over time this could cause errors.

float foo = 1.1f; /* As we know from above .1f can not be represented in binary */
bool correct = (foo + foo ) == 2.2f; /* The number closest to 1.1f is actually half the number closest to 2.2f */
bool incorrect = (foo + foo + foo) == 3.3f; /* Here's where it gets tricky, the number closest to 3.3f is actually less than the number closest to 1.1f summed with itself thrice*/

It shouldn't be surprising that $2.2$ and $1.1 + 1.1$ end up being equal. $2.2$ estimation would be exactly the same estimation as $1.1$ except with the exponent one higher.

The trick here is to almost never use equality on floats, always check to see if they are close to each other instead of straight equality.

The last note I'll say on unrepresentable numbers is just how often they occur in code. The factors for $10$ are $1$, $2$ and $5$. If we look at any decimal value as a numerator $N$ and a denominator $D$ where $D$ is always a power of $10$ then $N$ needs to be divisible by $5^{\log D}$. In laymans terms, for every decimal point in base 10, the value after the decimal point needs to be divisible by another 5. $0.x$ needs only be divisible by 5 ( So $0.5$ is the only possible solution) whereas $0.xx$ needs to be divisible by $ 5 * 5 = 25$ ( so $0.25$, $0.50$ and $0.75$ are all possible ).

Floating Point Basics

Matthew Bell — Tue, 08 Sep 2015 02:07:15 GMT

I love IEEE 754 floating point. It's one of the most ingenious and crucial parts of programming. It's also one of the most misunderstood. While people generally know how to use it, those who know how it works seem to be less common; unfortunate considering how error prone they can be. I'll do my best to explain the format in this post focusing solely on 32 bit normalized floats.

When the ^ character is used in this article it refers to "To the power of" and not "Exclusive or"

I should note before we get into this that, outside of a classroom, most of my knowledge of floating point representation comes from Bruce Dawson's Posts on the topic. I highly recommend reading them.

Scientific Notation

Floating point is essentially scientific notation for base 2. With that in mind let's have a very quick refresher on what scientific notation is.

Before we get started I should say there are many other resources on scientific notation, many (most?) of which will do a better job of explaining it than I will. This video from Khan Academy is should do a pretty good job.

Scientific notation is simply another way to represent real numbers. It follows the format of:

( - ) M * 10 ^ E

where M stands for mantissa and E stands for Exponent.

The mantissa is a decimal number [1, 10[. For example 1.0, 2.73344, 3.1415 are all acceptable mantissae; 0.5, 10.4312, 100.100 are not acceptable matissae.

[1, 10[ means any number from 1 to 10 including 1 but not including 10. It's often represented as [1, 10)

The Exponent is simply any integer.

This allows us to represent any real number, numbers greater than or equal to 10 will have a positive Exponent ( 11 for instance would be 1.1 * 10 ^ 1 ) while numbers less than 1 will have a negative Exponent ( 0.25 would be 2.5 * 10 ^ -1 ). Numbers in the range [1, 10[ will have an Exponent of 0. The benefits of this format as it pertains to base 10 numbers is beyond the scope of this article.

You'll often see the * 10 ^ E portion replaced with just the character 'e' or 'E' followed by the number. So 3.14 * 10 ^ 2 becomes 3.14e2.

Binary format

Let's transition this over to binary. If we take a look at our equation from before again with a small modification:

( -1 ) ^ ( S ) * M * 10 ^ E

( -1 ) ^ ( S ) is a convenient way of expressing "negative when set". If S is 0 then -1 ^ 0 == 1, if S is 1 then -1 ^ 1 == -1.

We see that there are in fact three part parts. The first part, which was largely ignored in the Scientific Notation section, is the sign which will be represented by a single bit. The second part is the exponent (E), this will be represented by 8 bits. That leaves us with 23 bits for the last part, the mantissa.

32 bits minus 1 for the sign and 8 for the exponent is 23

This gives us the binary format of:

s| |eeeeeeee| |mmmmmmmmmmmmmmmmmmmmmmm|

The equation itself is also going to change. We move from base 10 to base 2 and we swap the 10 for a 2.

( -1 ) ^ ( S ) * M * 2 ^ E

Let's look at how the binary format translates to this equation.

Sign

When the sign bit is set the number is negative. This is the same as multiplying by -1.

Exponent

The exponent is determined by pulling out the 8 exponent bits, taking the unsigned value of it (from 0 to 255) and subtracting 127 from it. Thus allowing the exponent to be negative.

Note: This is different from two's complement which is how signed integers are represented.

Mantissa

If you recall from above the mantissa is any number [1, 10[. Now that we're in base 2 that becomes [1, 2[ which means that the first digit is always a 1, we use that to our advantage by having an implicit 1 at the beginning of the mantissa giving us 24 bits of precision using 23 bits.

Special Values

We're ignoring a few special values here. Having exponent value of 0xFF gives us either infinity (if the mantissa bits are all set to 0) or various values to represent NaN (if any/all mantissa bits are set to 1). Having an exponent of 0x00 gives us either 0 (if the mantissa bits are all set to 0) or what is called a denormalized float which is beyond the scope of this article (math operations on denormalized floats are much slower and so denormalized floats are generally disabled in games.)

Bringing it all together

Taking all the information from above (ignoring the special values) we get the formula for interpreting the IEEE 754 32 bit floating point format. Where S is the sign bit, E is the unsigned byte value from the 8 exponent bits and M is the 23 mantissa bits.

value = ( ( -1 )^( S ) ) *
        1.M *
        (2 ^ ( E - 127));

1.M represents that we have an implicit bit due to the [1, 2[ interval of acceptable values for our mantissa. Another way of representing this would be (0x800000 + M).

C++ Coding Style

Matthew Bell — Fri, 28 Aug 2015 07:06:12 GMT

I often find that partway through a project I get the urge to rethink certain aspects of the style I'm using. I think that's partly because I've never really sat down and decided on a style; usually I just get started with whatever style I most recently matched at work. My intention here is to have a solid coding style doc to reference for any personal projects. It may be a living document that changes as I encounter new arguments for and against different styles but it should always reflect my current stance.

Rules

General rules

All lines should be less than 120 characters long.
No function should be greater than 100 lines.
Prefer descriptive variable names over single character ones. Abbreviations are acceptable.
Avoid nesting more than 3 levels deep of loops or ifs
Use tabs for indentation to block level and spaces for alignment.

Avoid Raw Loops

This rule was taken from: https://channel9.msdn.com/Events/GoingNative/2013/Cpp-Seasoning
Raw loops reduce readability of code and should be pulled out of larger functions.

The code:

// Do stuff above
u32 max = 0;
for ( int idx = 0; idx < collectionSize; ++idx )
{
    if ( collection[idx] > max )
        max = collection[idx];
}
// Do more stuff below

Should instead be written as

// Do stuff above
u32 max = MaxOf ( collection, collectionSize );
// Do more stuff below

This is a contrived example and also illustrates a case where using standard algorithms would be a better choice.

Naming Convention

Classes will begin with a capital letter and every new work will have it's first letter capitalized

class CamelCase;

Functions will follow the same naming convention of classes.

Plain Old Data (POD) types will be marked as structs and will be lower case with words separated by underscores

struct plain_old_data;

Variables will follow the naming convention of s_dName where s is the scope and d is the data type which will only be present in the case of primitive types: n for signed integer, u for unsigned integer, f for float, p for pointer, s for string (std or c-style) and b for bool. Scope may be omitted for local variables ( in which case the underscore is also omitted ); in other cases it will be m for member and g for global or static at the class level.

Const

The first const may go before or after the type. While before is considered more conventional, having the const after the type makes it more consistent with the other instances of const ( which also affect the previous declaration ).

Const functions should be preferred to non const functions. Pure functions should, in particular, be preferred as it increases code readability and reduces likelihood of unwanted state mutation.

If a reference is passed into a function it must be const. If a function is meant to mutate the state of a passed in variable that variable must be passed in as a pointer. This is to ensure all state mutation is explicit on the side of the function caller.

Whitespace

Parenthesis spacing

There should be whitespace after every opening parenthesis and before every closing parenthesis. Assignment and math operators should be surrounded by whitespace. Commas should have whitespace after them.

T variable = Function( arg0, arg1 );
T variable2 = variable + variable;

There should also be a whitespace after an if and before the parenthesis. There can optionally be one after a functions name and before the parenthesis.

if ( condition )
T0 Function ( T1 arg )

Logical spacing

Whitespace can be used to break up long functions into logical pieces. The better solution in many cases where this applies is to pull the functionality into it's own function. In the cases where doing so is not better, be sure to add a comment before the block describing the functionality.

There should also be spacing before after every Access Modifier in a class definition. The only exception to this is if a class definition starts with an access Modifier in which case it doesn't need the space before. This also applies to struct definitions.

class Class
{
public:

    T0 Function ();

private:

    u32 m_uVar;
}

Nested block spacing

There will be a linebreak before and after every opening brace and closing one. The only exception to this is when both opening and closing brace can fit on the same line. This exception does not apply if there are multiple related blocks ( such as an if followed by one or more else ifs and/or an else.

// Acceptable
T0 Function( T1 arg )
{
    // code
}

// Unacceptable
T0 Function( T1 arg ) {
    // code
}

// Also acceptable
if ( condition ) { /* code */ }

// Unacceptable
if ( condition ) { /* code */ }
else { /* code */ }

// Acceptable
if ( condition ) 
{
    // code
}
else
{
    // code
}

This applies also to else and else if meaning they must both be on their own line.