Parallel Sorting in Theory and in Practice I
We’re going to begin our discussion of parallel algorithms. We’ll do this by giving a parallel version of quick_sort_plus
. In the next entry (Parallel Sorting in Theory and in Practice (II)), we’ll introduce another sorting algorithm, merge_sort
, and show how it can be made to run in parallel in a couple of different ways. The first is simpler, but isn’t as fast as the second, more complicated implementation. The parallel function we provide in this entry probably shouldn’t be used in a real-world situation. It’s not actually that fast. What’s more, the one advantage that the serial version of quick_sort
has over merge_sort
, the fact that it doesn’t require much memory to work doesn’t even hold. That’s because it calls a function, parallel_pivot
, which has to make a copy of the input array, and so it uses about as much extra memory as merge_sort does. We start with it parallel_quick_sort_plus
, because it’s simpler to understand than parallel_merge_sort
.
Here’s a parallel implementation quick_sort_plus
in C++
Parallel quick_sort_plus()
|
|
We use omp.h
to make it run in parallel. You probably noticed the bizarre use of for (i = 0; i < 2; i++)
with an if
else
statement. We didn’t want to write it that way, but we couldn’t get nested parallelism to work when we tried implementing the function with parallel sections or tasks. Without nested parallelism, the recursive calls wouldn’t run in parallel.
There’s nothing earth-shattering going on here. If you can follow the serial version of quick_sort_plus
(given here Finding the Median and an Application to Sorting), then this should make sense.
You’ll notice we call parallel_kth_smallest
it’s a bit more complicated than the serial version of it. It provides the evenly-split pivot, which makes quick_sort_plus
work in guaranteed (O(n\log_2(n))) steps.
Here’s our implementation of that function.
parallel_kth_smallest()
|
|
The main difference between this and the serial version is that we compute the medians array in parallel. Because we’re computing in parallel, we cannot move the medians to the initial segment of array
, so we have to allocate a new array for them. We’ll have to allocate another array for similar reasons in parallel_pivot
.
Here’s our implementation of that function.
parallel_pivot()
|
|
We perform a “pivot” in parallel, i.e. move everything less than value
to the left section of array
, and everything else to the right by first creating two arrays of counters – two for each thread. We’re going to have each thread count how many smaller values it sees and how many larger values it sees. This obviously works in parallel, because each thread is updating different elements of the two arrays. Then, once we’re done with that, we perform what is sort of like an integral. We end up with
\[ \text{smaller_count}[i] = \sum\limits_{j = 0}^{i-1} \text{smaller}_j\]
where \(\text{smaller}_j\) is the number of elements less than value
seen by thread \(j\). It’s similar for \(\text{larger_count}[i]\).
When the scheduling of the threads over the elements of array
remains fixed from one for loop to another, this allows us to move the elements from array
to array_copy
without having to worry about any two threads writing to the same place. Fortunately for us, the default omp schedule does stay the same with for loops of the same length. If you’re using something other than omp, or just want to make sure that this works properly, you’ll have to set the schedule yourself.
This is what we do in the next for loop, and then there’s a final for loop which copies the values of array_copy
back to array
, so that it’s in the state it should be.
This function returns smaller_count[threads]
, which is equal to the total number of elements of array
which are strictly less than value
. With this, we have made quick_sort_plus
parallel.
Unfortunately, there’s not much of a reason to implement quick_sort_plus
in parallel, because it’s based on quick_sort_plus
, which is quite slow for an \(O(n\log_2(n))\) algorithm. We ran a test, sorting 200 million elements. Ordinary quick_sort
finished in 25.4 seconds, while our parallel_quick_sort_plus
took a whopping 87.7 seconds with 8 cores and 16 threads (Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz). That’s not a lot of threads, but even if you’re running on a computer with hundreds or perhaps thousands of threads, there’s still no point in using this function, because parallel merge_sort
would be faster.
We’ll implement merge_sort
in the next entry. For the rest of this entry, we will revisit the “Almost the Master Theorem”. We originally proved it in More on Sorting and Analyzing Run Time Complexity.
Let’s consider what happens when \(T(n) = f(n) + aT(\left\lceil \frac{n}{b}\right\rceil)\) is a recurrence relation, \(f(n) = cn^{\alpha}\log_2(n)\) and \(a = b^{\alpha}\). In this case the “Almost the Master Theorem” tells us that if we instead consider \(f_{\epsilon}(n) = cn^{\alpha + \epsilon}\), for some tiny \(\epsilon > 0\), then \(T(n) \in O(f_{\epsilon}(n)) \). That isn’t quite satisfying enough.
Notice that when we consider \(T(b^k)\), as we did in More on Sorting and Analyzing Run Time Complexity (equation 5), we get
\[ T(b^k) = c\sum\limits_{i=0}^{k} a^{i}(b^{\alpha})^{k-i} \log_b(b^{k-i})\log_2(b)) \tag{1}\label{1}\]
which is identical to equation (5) from that entry, except for the \(\log\) terms at the end. Also, notice that \(\log_2(b)\) can be absorbed into the constant \(c\), so the base of the logarithm can be changed to \(b\) without affecting much.
Since we assumed that \(a = b^{\alpha}\), we have
\[ a^{i}(b^{\alpha})^{k-i} = (b^{\alpha})^{i} (b^{\alpha})^{k-i} = (b^{\alpha})^k = (b^k)^{\alpha} \tag{2}\label{2}\]
If we let \(c’ = c\log_2(b)\), then combining equations (1) and (2) gives us
\[ c’(b^k)^{\alpha} \sum\limits_{i=0}^{k} \log_b(b^{k-i}) \tag{3}\label{3} \]
Which is easily seen to give us \[ c’(b^k)^{\alpha} \sum\limits_{i=0}^{k} {k-i} \tag{4}\label{4} \]
If we do the sum in the opposite order, we get
\[ c’(b^k)^{\alpha} \sum\limits_{i=0}^{k} {i} \tag{5}\label{5} \]
A nice arithmetic sum, which is equal to
\[ c’(b^k)^{\alpha} \frac{k(k+1)}{2} \tag{6}\label{6}\] Substituting \(n = b^k\), then gives us \[ T(n) = c’n^{\alpha} \frac{ \log_b(n)(\log_b(n) + 1) }{ 2 } \tag{7}\label{7}\]
So far, we’ve only shown that this holds when \(n\) is a power of \(b\), but in a very similar way to how we proved case 3 of the “Almost the Master Theorem” in More on Sorting and Analyzing Run Time Complexity, combining lemma (2) with lemma (1) from that entry, we have
\[ T(n) \leq c’b^{\alpha}(n)^{\alpha} [ \frac{(\log_b(n)+1)\log_b(n)}{2} + \log_b(n) + 1 ] \tag{8}\label{8}\]
This can be shown to be in
\[O(\log_2(n)^{2} n^{\alpha}) \tag{9}\label{9}\]
So, for now, this will be case 4 of the “Almost the Master Theorem”.
To recap:
Case #4 (almost the master theorem)
Let \(T(n) = f(n) + aT(\left\lceil \frac{n}{b}\right\rceil) \), where \( f(b) = cn^{\alpha}\log_2(n)\), then
\(T(n) \in O(\log_2(n)^{2}n^{\alpha}) \)
Proof
Given above.
That’s it for this entry. We continue this discussion in the next entry, Parallel Sorting in Theory and in Practice (II).