Radix Sort

Radix Sort has always been my favourite sort. I learned it while infatuated with bitwise operation and the arcane tricks that I was learning at the time. It also stands out as being a sort without any comparison between elements.

We'll explore using Radix sort with 3 different types of data: the supersimple unsigned 32 bit integer, the slightly less simple signed 32 bit integer and the fun 32 bit floating point. We'll discuss how Radix sort works and use it as an example of how knowing the format of the data you're working with can come in useful.

The theory behind it

The Radix Sort implementation we'll be looking at is a simple idea that is best described as sorting into buckets using a subset of each element as a key at a time. The easiest way to demonstrate this is sorting base 10 numbers one digit at a time. For base 10 digits this means we'll start with the right most digit first as it represents the Least Significant value; we'll discuss why later.

The term significance is going to be used a lot. Significance describes how much of an effect a portion of an element has in relation to other portions. In base 10, the digit in the 100s position has more significance than the digit in the 10s position as it contributes more to the value of the number as a whole. Said another way a 1 in the 100s position is more than a 1 in the 10s position. It functions the same for bits and bytes, values further "left" have greater significance.

We start with the following values 10, 45, 100, 9, we then go through each and sort based off the first digit keeping the order consistent when values share the same digit (as is the case for 10 and 100, both have 0 in the right most position). We do this for each digit, treating any non-present digit as 0 (Meaning, for instance, in 9 we treat the second digit from the right as 0):

010, 045, 100, 009
010, 100, 045, 009
100, 009, 010, 045
009, 010, 045, 100

We start with the Least Significant value because our algorithm relies on the stable nature of each pass of the sort. What this means is that each pass will not change the order of values that have the same value. If we had 46 in the above example, it would have been sorted after 45 in the first pass and then the second and third passes would not have changed it's position in relation to 45. 36 on the other hand would start being sorted after 45 in the first pass but it's position would be corrected in the second pass. If you're unconvinced, I suggest using the examples above and sorting by the leftmost digit first, you'll end up with 100, 010, 045, 009

Our algorithm iterates over each byte and, within that iteration, iterates over each element twice. I'll endeavour to refer to the internal element iterations as first and second "pass" or "per-byte pass". What this breaks down as is a structure of for(bytes) { for (elements) {} for (elements){} }

Now all we need to do is extend that to the specific data that we have.

Unsigned Int

Why unsigned first

A lot of people dissuade programmers from using unsigned ints. I don't agree with the advice, I'd say that you should never (unless you know exactly what you're doing) mix signed and unsigned ints. When separated however, they each have their own advantages.

The plus side to unsigned integers is they behave how you expect when doing bitwise operations and, of the data types we're considering, they are the least abstracted from their raw data layout. We'll discuss these differences more when we're dealing with signed integers. It's this low level of abstraction that makes them the best first data type to implement this with.

We know how Radix Sort is supposed to work on base 10 numbers: going digit by digit. In our implementation we're going to go byte by byte instead of digit by digit. There's no reason you couldn't divide up your elements differently, in the above base 10 sample you could, for instance, do 2 digits at a time. Bytes are nice in that it's easy for others to determine what we're doing (x & 0xffu is a very common operation) and they are large enough to speed up the algorithm significantly.

x & 0xffu is a bitwise and against the Least Significant Byte. What this means is that it will keep whatever value is in that byte and set all other bytes to 0. The u at the end of 0xffu designates the literal as unsigned.

There's also the matter of concrete implementation details. The way we're going to approach dividing things is to do two sub passes per byte. The first pass will go through and keep track of how many elements belong in each bucket. We will then take this counts array and turn it into an array of starting indices. With that in hand we will go through our elements again (pass two now) and place them into the correct index based on their bucket, incrementing that bucket's index each time something is place in that bucket.

A good way to understand this two pass approach is to consider the case where we're sorting unsigned bytes instead of 32 bit unsigned integers. In that case you could just count the 0s, 1s, 2s...255s and then just store however many 0s there are then however many 1s there were and so on and so forth in your array. That's essentially what we do on a per byte basis but we also preserve extra information (the other 3 bytes on the 32 bit unsigned integer)

With all that in mind, here's the code

void RadixSort(u32* intArr, u32 numInts)
{
    // We will store the counts in here and then "accumulate" them in order to
    // turn them into iterators
    u32 iters[256];
    // Our implementation is not inplace so we need a swap buffer
    u32* swap = new u32[numInts];

    // 4 passes, one per byte
    for (u32 idx = 0; idx < 4; ++idx)
    {
        // Set all iterators/counts to 0
        ClearCounts(iters);
        // Increment count for corresponding bucket
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
            iters[bucket]++;
        }

        // Translate counts into iterator
        AccumulateCounts(iters);
        // Place values at their proper spot based off its buckets iterator
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 value = intArr[jdx];
            u32 bucket = (value >> (idx * 8)) & 0xffu;

            u32 loc = iters[bucket]++;
            swap[loc] = value;
        }

        // Swap arrays to get ready for next pass
        u32* temp = intArr;
        intArr = swap;
        swap = temp;
    }

    // Clean up allocated memory
    delete[] swap;
}

Now, that's some pretty dense code so don't worry if you don't understand it right away. Let's step through it slowly.

External Setup

First we have some simple setup. Our algorithm includes keeping track of groupings of bytes, there are 256 possible values for a single byte so we create an array of 256 unsigned integers in order to keep track of their counts. The algorithm also isn't in place so we need to create a new buffer that we can write to while reading from our original array.

// We will store the counts in here and then "accumulate" them in order to
// turn them into iterators
u32 iters[256];
// Our implementation is not inplace so we need a swap buffer
u32* swap = new u32[numInts];

Next we are going to do the sub-algorithm for each byte in the elements. Our elements are unsigned 32 bit integers which have 4 bytes (a byte being 8 bits) each, therefore we need to iterate 4 times: once per byte.

// 4 passes, one per byte
for (u32 idx = 0; idx < 4; ++idx)

First Per-byte Pass

Our per-byte algorithm can be split into two iterations over the elements. Before our first pass we clear out our counts buffer, this is simply iterating through and zero-ing out every element on it (we don't show the code here due to its trivial nature). Once done we iterate through every element, grab the appropriate byte and increment the count for that byte

// Set all iterators/counts to 0
ClearCounts(iters);
// Increment count for corresponding bucket
for (u32 jdx = 0; jdx < numInts; ++jdx)
{
    u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
    iters[bucket]++;
}

Before we go any further let's look at u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu; in a bit more depth. This is how we're grabbing the appropriate byte. In order of operations we have idx * 8, this is simply to modify the next step from a bitwise operation to a bytewise: a byte is simply 8 bits. Next intArr[jdx] >> (idx * 8), here we are grabbing the element in question intArr[jdx] and then shifting it to the right by as many bytes as our outer loop index. This means that in each outer loop iteration will have the appropriate byte shifted into the Least Significant Byte position. Which brings us to the last step & 0xffu will mask out all but the Least Significant Byte, all other bytes will be 0.

To summarize: We calculate the right shift necessary to bring the appropriate byte into the Least Significant Byte position, we execute that right shift and then we set all but the Least Significant Byte to 0.

Accumulation

Next we accumulate the counts in such a way that instead of having counts in the counts buffer we have iterators.

// Translate counts into iterator
AccumulateCounts(iters);

Unlike the ClearCounts function we will quickly go over the AccumulateCounts function.

void AccumulateCounts(u32 counts[256])
{
    u32 increment = counts[0];
    counts[0] = 0;
    for (u32 idx = 1; idx < 256; ++idx)
    {
        u32 nextIncrement = counts[idx];
        counts[idx] = counts[idx - 1] + increment;
        increment = nextIncrement;
    }
}

The purpose of this function is to convert counts into iterators. We do this by setting every element to the sum of all the previous elements. In order to achieve this we store the count of the index in question into a temporary variable, we then set the value at the index to the previous iterations result (or 0 in the case of the first element) plus the increment variable from the previous iteration and finally store our temporary variable into the increment variable for the next iteration.

As an example let's look at our decimal numbers from before: 010, 045, 100, 009. After the first pass we would have a counts array of 2 (010 and 100), 0, 0, 0, 0, 1 (045), 0, 0, 0, 1 (009). After AccumulateCounts we would have a counts array of 0, 2, 2, 2, 2, 2, 3, 3, 3, 3. This would allow us to store 010 and 100 in the first 2 indices (0 and 1), 45 is the 3rd index (2) and 009 in the last index(3). If you don't immediately understand this don't worry, the next bit of code we are looking at will implement this.

Second Per-byte iteration

// Place values at their proper spot based off its buckets iterator
for (u32 jdx = 0; jdx < numInts; ++jdx)
{
    u32 value = intArr[jdx];
    u32 bucket = (value >> (idx * 8)) & 0xffu;

    u32 loc = iters[bucket]++;
    swap[loc] = value;
}

Above is the second pass of our per byte algorithm. For each element, we find which bucket it belongs to using the calculation you saw in the first pass (value >> (idx * 8)) & 0xffu;, we use that value to lookup into the counts buffer, grab the current value and then increment it. This is what is meant when we say the counts buffer becomes a buffer of iterators. We use the value we just grabbed as an index into the swap buffer and set the value at that index to the element of the array we're processing.

Setup for next iteration and cleanup

The last part of the per byte algorithm is simply to swap the buffers to that they are ready for the next iteration

// Swap arrays to get ready for next pass
u32* temp = intArr;
intArr = swap;
swap = temp;

And once we have processed each byte we free the memory we previously allocated at the beginning of the function.

// Clean up allocated memory
delete[] swap;

Deeper dive into our decimal example

With all this in mind let's look at our decimal example again and go through the algorithm in a little more detail. We start with the same 4 numbers as before: 010, 045, 100, 009.

First our counts buffer, in this case it's just an array of 10 as there are only 10 different digit values (as opposed to 256 byte values). We go through and count the number of times each digit occurs which gives us:
2 (010 and 100), 0, 0, 0, 0, 1 (045), 0, 0, 0, 1 (009).

Next we turn this into an accumulation buffer using the algorithm we went over above:
0, 2, 2, 2, 2, 2, 3, 3, 3, 3

We now go through our elements again.

With 010 we look at the 0th element of the counts array, use that as our index into the swap buffer and then increment it. Resulting in the following setup:
Swap = {010, xxx, xxx, xxx}
Counts = {1, 2, 2, 2, 2, 2, 3, 3, 3, 3}

With 045 we do the same with the 5th element, which is originally a 2:
Swap = {010, xxx, 045, xxx}
Counts = {1, 2, 2, 2, 2, 3, 3, 3, 3, 3}

100 is similar to 010 but the value at the 0th index is now 1 instead of 0:
Swap = {010, 100, 045, xxx}
Counts = {2, 2, 2, 2, 2, 3, 3, 3, 3, 3}

Finally with 009:
Swap = {010, 100, 045, 009}
Counts = {2, 2, 2, 2, 2, 2,, 3, 3, 3, 4}

Then on to the next iteration but using the previous Swap buffer as our array and looking at the next element

Counts: 2 (100 and 009), 1(010), 0, 0, 1 (045), 0, 0, 0, 0, 0
After Accumulation: 0, 2, 3, 3, 3, 4, 4, 4, 4, 4

010
Swap = {xxx, xxx, 010, xxx}
Counts = {0, 3, 3, 3, 3, 4, 4, 4, 4, 4}

100
Swap = {100, xxx, 010, xxx}
Counts = {1, 3, 3, 3, 3, 4, 4, 4, 4, 4}

045
Swap = {100, xxx, 010, 045}
Counts = {1, 3, 3, 3, 4, 4, 4, 4, 4, 4}

009
Swap = {100, 009, 010, 045}
Counts = {2, 3, 3, 3, 4, 4, 4, 4, 4, 4}

And the last digit
Counts: 3 (009 and 010 and 045), 1(100), 0, 0, 0, 0, 0, 0, 0, 0
After Accumulation: 0, 3, 4, 4, 4, 4, 4, 4, 4, 4

100 Swap = {xxx, xxx, xxx, 100}
Counts = {0, 4, 4, 4, 4, 4, 4, 4, 4, 4}

009 Swap = {009, xxx, xxx, 100}
Counts = {1, 4, 4, 4, 4, 4, 4, 4, 4, 4}

010 Swap = {009, 010, xxx, 100}
Counts = {2, 4, 4, 4, 4, 4, 4, 4, 4, 4}

045 Swap = {009, 010, 045, 100}
Counts = {3, 4, 4, 4, 4, 4, 4, 4, 4, 4}

And there we go, a sorted array without a single comparison between the elements. Now, let's look at how we can modify it to work with signed integers and, finally, floats.

Signed ints

We mentioned above that we'd use unsigned integers first because they are less of an abstraction away from the actual data than signed integers. Specifically, an unsigned integer with a larger value in the Most Significant Byte is larger than one with a smaller value in the Most Significant Byte. Signed integers, however, have a slight quirk at the Most Significant bit, let's look at that.

Two's Complement

Signed integers come with the added complexity of needing to represent negative numbers. The abstraction ints use is called Two's Complement and all it does is negate the most significant bit. So, if we had a three bit unsigned int we'd have bit values 4 2 1 but our three bit signed int would have bit values -4 2 1, giving us the below values:

bits = unsigned = signed
000 = 0 = 0
001 = 1 = 1
010 = 2 = 2
011 = 3 = 3
100 = 4 = -4
101 = 5 = -3
110 = 6 = -2
111 = 7 = -1

This may seem weird but there are some benefits. First and most obviously, positive numbers have the same representation as their unsigned variant (As you can see in the first four entries above). Secondly, besides the the transition from 3 to -4, all values increase as the bits increase. That's all that really matters for the sake of this article.

The code

Knowing that only the last bit behaves differently for signed integers we can safely run the same algorithm as before over the Least Significant 3 bytes.

The difference is how we treat that last byte.

void RadixSort(i32* signedIntArr, u32 numInts)
{
    // We will store the counts in here and then "accumulate" them in order to
    // turn them into iterators
    u32 iters[256];
    // Our implementation is not inplace so we need a swap buffer
    u32* swap = new u32[numInts];
    // We're going to treat our signedIntArr as an unsigned one
    u32* intArr = reinterpret_cast<u32*>(signedIntArr);

    // 3 passes, one per byte.  The last byte needs to be modified to account
    // for two's complement
    for (u32 idx = 0; idx < 3; ++idx)
    {
        // Set all iterators/counts to 0
        ClearCounts(iters);
        // Increment count for corresponding bucket
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
            iters[bucket]++;
        }

        // Translate counts into iterator
        AccumulateCounts(iters);
        // Place values at their proper spot based off its buckets iterator
        for (u32 jdx = 0; jdx < numInts; ++jdx)
        {
            u32 value = intArr[jdx];
            u32 bucket = (value >> (idx * 8)) & 0xffu;

            u32 loc = iters[bucket]++;
            swap[loc] = value;
        }

        // Swap arrays to get ready for next pass
        u32* temp = intArr;
        intArr = swap;
        swap = temp;
    }

    // We modify out last pass in order to account for two's complement.
    // What this boils down to is swapping the Most Significant bit of 
    // every element when doing bucket calculations
    ClearCounts(iters);
    for (u32 jdx = 0; jdx < numInts; ++jdx)
    {
        u32 bucket = (intArr[jdx] >> (24)) & 0xffu;
        bucket ^= 0x80;
        iters[bucket]++;
    }

    AccumulateCounts(iters);
    for (u32 jdx = 0; jdx < numInts; ++jdx)
    {
        u32 value = intArr[jdx];
        u32 bucket = (value >> 24) & 0xffu;
        bucket ^= 0x80;

        u32 loc = iters[bucket]++;
        swap[loc] = value;
    }

    u32* temp = intArr;
    intArr = swap;
    swap = temp;

    // Clean up allocated memory
    delete[] swap;
}

The code is near identical, there is in fact only one real differences in how we treat that last byte. In each pass we modify the "bucket" with this calculation ^= 0x80u. All that does is flip the Most Significant bit of the byte which, in the case of the Most Significant Byte, is the Most Significant bit of the whole integer. Keep in mind this calculation only affects our lookup, it does not modify the value that we actually store.

There are a lot more lines of code for "The code is near identical", this is because we pulled out one of our byte iterations, if you look the code outside of the first for loop looks a lot like the code within it

Let's look at our 3 digit signed integer and see how this works. Lets take the number -3 2 and 0 or in binary 101, 010, 000. If we flip the most significant bit we get -3 = 001, 2 = 110 and 000 = 100 which matches the order we expect -3 < 0 < 2 -> 001 < 100 < 110.

Now, let's take it one step further.

Floating Points

I like to think I have a decent couple articles going over floating point, I would suggest you read the Floating Point Basics before reading this section. Alternatively, there are many (better) resources going over FP, Bruce Dawson's articles being my favourite.

What we care about in regards to Floating Point representation is that it doesn't quite work the same way twos complement works. Positive numbers can be sorted the same way (as signed or unsigned integers) because the exponent, which determines the size of a number more than the mantissa does, is in a more Significant position. The issue is, unsurprisingly, with negative numbers. Instead of subtracting the value of the Most Significant bit, floating point mirrors based on that bit.

1.0f is stored as 0x3f800000, in order to get -1.0f you just flip the most significant bit which gives you 0xbf800000, 2.0f is stored as 0x40000000 and -2.0f is 0xc0000000. As you can see, while -2.0f is less than -1.0f, the value when interpreted as an unsigned integer is greater (0xc0000000 > 0x40000000); this wasn't true of -2 and -1 in signed integer (0xfffffffe and 0xffffffff respectively).

All this means however, is that we use our signed integer approach and then do some quick post processing on the negative numbers.

void RadixSort(float* floatArr, u32 numFloats)
{
    // We will store the counts in here and then "accumulate" them in order to
    // turn them into iterators
    u32 iters[256];
    // Our implementation is not inplace so we need a swap buffer
    u32* swap = new u32[numFloats];
    // We're going to treat our floatArr as an unsigned one
    u32* intArr = reinterpret_cast<u32*>(floatArr);

    // 3 passes, one per byte.  The last byte needs to be modified to account
    // for floating point representation
    for (u32 idx = 0; idx < 3; ++idx)
    {
        // Set all iterators/counts to 0
        ClearCounts(iters);
        // Increment count for corresponding bucket
        for (u32 jdx = 0; jdx < numFloats; ++jdx)
        {
            u32 bucket = (intArr[jdx] >> (idx * 8)) & 0xffu;
            iters[bucket]++;
        }

        // Translate counts into iterator
        AccumulateCounts(iters);
        // Place values at their proper spot based off its buckets iterator
        for (u32 jdx = 0; jdx < numFloats; ++jdx)
        {
            u32 value = intArr[jdx];
            u32 bucket = (value >> (idx * 8)) & 0xffu;

            u32 loc = iters[bucket]++;
            swap[loc] = value;
        }

        // Swap arrays to get ready for next pass
        u32* temp = intArr;
        intArr = swap;
        swap = temp;
    }

    // We modify out last pass in order to account for fp representation
    // What this boils down to is swapping the Most Significant bit of 
    // every element when doing bucket calculations.  This is the same
    // as two's complement however, we will run more logic afterwards to
    // account for the differences between two's complement and fp representation
    ClearCounts(iters);
    for (u32 jdx = 0; jdx < numFloats; ++jdx)
    {
        u32 loc = (intArr[jdx] >> (24)) & 0xffu;
        loc ^= 0x80;
        iters[loc]++;
    }
    AccumulateCounts(iters);
    for (u32 jdx = 0; jdx < numFloats; ++jdx)
    {
        u32 value = intArr[jdx];
        u32 bucket = (value >> 24) & 0xffu;
        bucket ^= 0x80;

        u32 loc = iters[bucket]++;
        swap[loc] = value;
    }

    u32* temp = intArr;
    intArr = swap;
    swap = temp;

    // Clean up allocated memory
    delete[] swap;

    // In order to account for negative floating points decreasing in
    // value (-2.0f < -1.0f) while the bit representation increases 
    // (0xc0000000 > 0xbf800000) we need to mirror our array over the mid
    // point
    u32 firstPositive = iters[0x7f];
    for (u32 idx = 0; idx < firstPositive/2; ++idx)
    {
        u32 temp = intArr[idx];
        intArr[idx] = intArr[firstPositive - idx - 1];
        intArr[firstPositive - idx - 1] = temp;
    }
}

The important part is right at the end. First we want to iterate only over negative numbers so we grab the iterator for the last negative byte value (0x7f, remember we flip the 0x80 bit in our algorithm meaning we're looking at values without the bit set). A side effect of our algorithm is that the iterator of the last negative number will be the index of the first positive number.

We then iterate from the beginning to the halfway point, switching element at index idx with the element at index firstPositive - 1 - idx; firstPostive - 1 being another way of saying "last negative". And we're done, surprisingly doing this postprocess is faster than calculating the correct index for negative numbers in the last pass. Maybe we'll look into that in a future article.

Wrap up

What makes Radix Sort so interesting is it's main pros and cons: It's one of the best scaling sort but, it's not general purpose. We were able to extend it to signed integers and floating point with relative ease but arbitrary classes wouldn't be possible: you can't really provide a comparison algorithm for a sort that doesn't compare.

In regards to scaling, there isn't much to say, Radix sort has no worst case and scales at O(nk) where k is the number of bytes. This closely matches O(n * log(n)), in tests Radix Sort ran about 4x as fast as Quicksort (which is 0(n * log(n))) using randomly generated sets, simulating Worst case however has Quicksort (O(n^2)) scaling at a much worse rate, running 1000x as slow as Radix at 100000 elements and almost 2500x slower at 300000.

Regular Case:
- 100000
  - Radix: 1.96ms
  - Quick : 8.40ms
- 200000
  - Radix: 3.5ms
  - Quick: 17.3ms
- 300000
  - Radix: 5.82ms
  - Quick: 26.71ms
- 1000000
  - Radix: 19.34ms
  - Quick: 78.64ms
- 10000000
  - Radix: 198.01ms
  - Quick: 795.10ms
- 100000000
  - Radix: 1837.22ms
  - Quick: 7380.67ms
Worst Case
- 100000
  - Radix: 1.95ms
  - Quick : 1740.75ms
- 200000
  - Radix: 3.7ms
  - Quick: 6535.66ms
- 300000
  - Radix: 5.90ms
  - Quick: 14851.15ms

These numbers are somewhat useless on their own but at least show a rough comparison between a Radix Sort and a Quick Sort. Quick Sort has many benefits on it's own and would be a great topic for another time, it was chosen here just because it's Worst Case was very easy to simulate (already sorted list) and it's general prevalence.

I suppose an important question to answer is "When do I use Radix Sort?" and the answer is basically: Whenever you can for large enough lists. Radix Sort has some fixed cost (clearing/accumulating the iters buffer) that make it not ideal for very small numbers of elements. However the that quickly becomes irrelevant. The main downside of Radix sort is its inflexibility. You won't always be able to use it.

So there we are, hopefully at this point you have a pretty good idea of how Radix Sort (or at least this implementation of it) works. Also worth pointing out is that we were able to extend this algorithm to both signed integers and floats by understanding their bit layout. While going through programming just accepting things work a specific way is definitely possible, there is always a benefit from spending time to understand the minutiae.