Parallel Sorting in Theory and in Practice III
As promised, the last of a three-part series of entries on sorting in parallel. If you haven’t yet, check out the first two here Parallel Sorting in Theory and in Practice (I) and here Parallel Sorting in Theory and in Practice (II). Here we present a parallel implementation of merge_sort
which runs in \(O(n \log_2(n))\) time and achieves fairly good parallelism.
This implementation of parallel_merge_sort
is identical to the one presented in part II, except for the parallel_merge
function. In the original implementation, we had two parallel for loops, one which iterated over i
and calculated j
and k
, and another one which iterated over j
and calculated i
and k
. In the updated version of merge_sort
, what we’ll call “merge_sort2
”, we select several almost evenly-spaced k
’s, and then calculate both i
and j
for each of those k
’s. In fact, no matter how large the input array, we only calculate i
and j
for a fixed number of k
’s. These are the endpoints of non-overlapping intervals. The threads will loop over these intervals so that each handles a different subset of them. This way we split the total work involved in merging up into several independent sub-arrays. Since we do it only for a fixed amount of k
’s, no matter how large the input arrays are, we add at most a constant amount of extra work. The awesome part is, though, that calculating the i
’s and the j
’s takes only \(O(\log_2(n))\) steps, so we aren’t even adding that much extra work.
The real magic of merge_sort2
comes from the function which calculates the i
’s and the j
’s. We call it kth_element
. The name might seem a bit odd at first, but it does make sense. What it does is, given an input of two sorted arrays, array1
and array2
of lengths length1
and length2
, respectively, and ak
, it will compute the index of the \(k^{th}\) element. It will also return a value, array_number
which identifies which array contains the \(k^{th}\) element.
We’ve implemented it in C++ as a void function which uses the &
operator to change the values of index1
, index2
, and array_number
by reference, while looking like they’re passed as values. If that bothers you (and we can understand why it might), feel free to implement it as a function which returns a struct or class containing all those values. We certainly wouldn’t argue that we’ve implemented anything in the best possible style.
Here’s our implementation.
kth_element()
|
|
The function is a bit complicated, so we’ve labeled subsections with comments in the code. The first four cases are the base cases. Hopefully they all make sense. The first two just handle what happens if one of the arrays has no elements in it. That’s incredibly easy, because the other array is already ordered. The \(k^{th}\) element is just the element in the \(k^{th}\) position of that array. Easy. The third case is where \(k = 0\), so you just have to compare the two smallest elements of the two arrays. Easy. The third one got us in trouble when we forgot to include it. It’s actually possible to have both arrays have length \(1\), and if \(k\) is also equal to \(1\), then the recursive cases might not get any smaller. That’s because you’re then guaranteed to have n1
and n2
both equal to 0
, so you end up in case (r-2)
. This gets you into an infinite loop, because if you look at what happens to all of the values of the inputs to the recursive call to kth_element
, they’re all the same as the parent function’s inputs. Thus you need to include base case (d)
.
Now we’re into the search cases. There are two possible cases. They’re designed to guarantee that we eliminate at least \(\left\lfloor \frac{\max(\text{length1}, \text{length2})}{2} \right\rfloor\) elements for the next iterative step. We’ll come back to that in a little bit. For now, you might be wondering why we have to use two different functions to calculate n1
and n2
. It’s the same reason we had to use the different functions in the original parallel_merge
given in the previous entry (Parallel Sorting in Theory and in Practice (II)). It’s because in providing an ordering to the two arrays, we have to have a way to handle ties between the two arrays. For example, if array1[i]
is equal to array2[j]
, we have to be consistent in deciding which comes before the other. We have arbitrarily chosen elements of array1
to come before elements of array2
. Therefore, an element of array1 comes after every element of array2
which is strictly less than it. Here’s an example we hope is helpful.
Suppose array1 = [1,1,2,3,5,8,13]
and array2 = [1,2,3,4,5,6,7]
. Then length1
= length2
and they’re both equal to 7, so we’re in search case (s-1)
, this means that n1 = 3
, and that means that array1[n1]
is equal to 3
. There’s also a 3
in array2
! Since we’ve decided that elements of array1
come before elements of array2
, that 3
(in array1
) comes after the first two elements of array2
, the sub-array [1,2]
, (but before the 3
in array2
). This has length 2, which is exactly what is returned by least_upper_bound(array2, array1[n1], length2)
. This is because least_upper_bound(array2, n, length2)
is designed to return the smallest index j
of an element of array2
such that every smaller index has an element which is strictly less than n
. In this way, n
is a least (strictly) upper bound of the array with indices less than j
.
On the other hand, suppose that array2 = [0,1,2,3,4,5,6,7]
, but that array1
is exactly as it was in the previous example. Then we’re no longer in search case (s-1)
, we’re now in search case (s-2)
. Now n2
is equal to 4
, so we’re going to have to find all of the elements of array1
which come before array2[n2]
, which is equal to 3
. Once again, 3
is an element in both arrays. We want to find all of the elements of array1
which come before this 3
in array2
. We’ve decided that any 3
in array1
should come before this 3
, so we want to find the index of the first element which is strictly greater than 3
. That is exactly what min_greater_than(array2, array2[n2], length1)
returns. In this case that will be 4
, which is the index of the 5
in array1
. It is also the length of the initial sub-array [1,1,2,3]
of array1
.
Now we return to considering how many elements we’ve eliminated from the recursive step. Notice that either n1 = length1 / 2
or n2 = lengh2 / 2
, depending on which is larger. If that length is an even number, then half of it is exactly half. Otherwise, if that length is odd, then half of it in C++ is equal to \( \left\lfloor\frac{\text{length}}{2}\right\rfloor \). This means that if we end up in recursive case (r-2)
, then the sub-array fed into the recursive call to kth_element
will have length \(\left\lceil \frac{\text{length}}{2} \right\rceil\). The other array can have any length \(\leq \) its original length fed into the function at the top.
For example, if we call kth_element
with two arrays of length length1
and length2
, where length1
\(\geq\) length2
, then the recursive call to kth_element
will have two arrays, where the length of array1
, length1
, is \(\leq \left\lceil \frac{ \text{length1} }{2} \right\rceil \). What this means is that if we let \(n = \text{length1} + \text{length2} \), then the combined size of the inputs to the recursive call to kth_element
will be \( \leq \left\lceil \frac{3n}{4} \right\rceil + 1\).
Proof
We can see that the combined size of the input arrays to the recursive calls is
\[ \leq n - \left\lfloor { \frac{ { \left\lceil \frac{n}{2} \right\rceil } }{2} } \right\rfloor \tag{1}\label{1} \]
If we let \(n = 4q + r\), where \(0 \leq r < 3\), then \( \left\lceil \frac{n}{2}\right\rceil \) is equal to \(2q + \left\lceil \frac{r}{2} \right\rceil \). Then, dividing by \(2\) and applying the floor function gets us
\[q + \left\lfloor { \frac{ { \left\lceil \frac{r}{2} \right\rceil } }{2} } \right\rfloor \]
This means that the right side of equation (1) is equal to
\[ 3q + r - \left\lfloor { \frac{ { \left\lceil \frac{r}{2} \right\rceil } }{2} } \right\rfloor \tag{2}\label{2} \]
Whereas \( \left\lceil \frac{3n}{4}\right\rceil + 1 \) is equal to
\[ 3q + \left\lceil \frac{r}{4} \right\rceil + 1 \tag{3}\label{3} \]
Checking the cases where \(r = 0, 1, 2\), or \(3\), we see that \( n - \left\lfloor { \frac{ { \left\lceil \frac{n}{2} \right\rceil } }{2} } \right\rfloor \) is less than or equal to \( \left\lceil \frac{3n}{4} \right\rceil + 1\).
QED
Using this result we can show that for large enough \(n\), we have \( n - \left\lfloor { \frac{ { \left\lceil \frac{n}{2} \right\rceil } }{2} } \right\rfloor \) is less than or equal to \( \left\lceil \frac{n}{1.3} \right\rceil \).
If we let \(T(n)\) represent, as usual, the number of steps it takes kth_element to finish on an input whose combined length is \(n\), then we have
\[T(n) \leq cn^0 + T(\left\lceil \frac{n}{1.3} \right\rceil) \tag{4}\label{4}\]
If we apply equation (4) above with the “Almost the Master” Theorem from More on Sorting and Analyzing Run Time Complexity, then we see that \(T(n) \in O( \log_2(n) ) \).
Now we’re ready to see how we use kth_element
to make parallel_merge2
work. Here’s our implementation.
parallel_merge2
|
|
As mentioned in the comments, all of the non-overlapping intervals have exactly the same length, with the possible exception of the last interval. That one has length equal to length - (length / (n_threads*MERGE_BUCKET_MULTIPLIER)) * n_threads*MERGE_BUCKET_MULTIPLIER
. Let
$$ \begin{align} l &= \text{length} \\ t &= \text{n_threads} \\ B &= \text{ MERGE_BUCKET_MULTIPLIER } \\ \end{align} $$
then the length of the last interval is equal to
\[ l - tB \left\lfloor \frac{l}{tB} \right\rfloor \tag{5}\label{5} \].
Now let \(l = tB \cdot q + r \), where \( 0 \leq r < tB \), then equation (5) becomes
\[ (tB)q + r - tB( q + \left\lfloor \frac{r}{tB} \right\rfloor ) \]
which is exactly equal to \(r\), because \(\left\lfloor \frac{r}{tB} \right\rfloor = 0 \).
This means that the loads for the non-overlapping intervals are evenly balanced, except for the last one, which will have to merge fewer elements. Since each iteration of each for loop handles a different interval, we don’t have to worry about concurrency issues, and there is no need for semaphores! This allows this sorting function to achieve a high degree of parallelism. Indeed, in the real-world test case of sorting an array of 200 million elements, it completed this task in 4.2 seconds. This was on the exact same input with which the other algorithms were tested. Again, we used 8 cores and 16 threads with an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz processor. Nothing exceptionally fancy.
In all the tests, we only counted the number of milliseconds it took to complete the sorting function. We did not include the time it took the data to be loaded into memory. Therefore, they were reasonably fair comparisons between the functions. To recap, we have
Algorithm | Time | Speed Up |
---|---|---|
Parallel merge sort with “parallel merge 2” | 4.2 seconds | 4.8x |
Parallel merge sort with “parallel merge” | 6.5 seconds | 3.1x |
Serial merge sort | 20.0 seconds | |
Serial quick sort | 25.4 seconds | |
Parallel quick sort “plus” | 87.7 seconds | 9.2x |
Serial quick sort “plus” | 811.4 seconds |
You can see that parallel_quick_sort_plus
actually achieved good parallelism with 9.2x speedup on 16 threads. It’s poor performance is simply a result of the underlying function’s poor performance. Don’t use it on real-world applications!
That’s it for parallel sorting, at least for now. If you haven’t read them yet, check out the first two entries on this topic: Parallel Sorting in Theory and in Practice (I) and Parallel Sorting in Theory and in Practice (II).