Parallel Sorting in Theory and in Practice III

As promised, the last of a three-part series of entries on sorting in parallel. If you haven’t yet, check out the first two here Parallel Sorting in Theory and in Practice (I) and here Parallel Sorting in Theory and in Practice (II). Here we present a parallel implementation of merge_sort
which runs in
This implementation of parallel_merge_sort
is identical to the one presented in part II, except for the parallel_merge
function. In the original implementation, we had two parallel for loops, one which iterated over i
and calculated j
and k
, and another one which iterated over j
and calculated i
and k
. In the updated version of merge_sort
, what we’ll call “merge_sort2
”, we select several almost evenly-spaced k
’s, and then calculate both i
and j
for each of those k
’s. In fact, no matter how large the input array, we only calculate i
and j
for a fixed number of k
’s. These are the endpoints of non-overlapping intervals. The threads will loop over these intervals so that each handles a different subset of them. This way we split the total work involved in merging up into several independent sub-arrays. Since we do it only for a fixed amount of k
’s, no matter how large the input arrays are, we add at most a constant amount of extra work. The awesome part is, though, that calculating the i
’s and the j
’s takes only
The real magic of merge_sort2
comes from the function which calculates the i
’s and the j
’s. We call it kth_element
. The name might seem a bit odd at first, but it does make sense. What it does is, given an input of two sorted arrays, array1
and array2
of lengths length1
and length2
, respectively, and ak
, it will compute the index of the array_number
which identifies which array contains the
We’ve implemented it in C++ as a void function which uses the &
operator to change the values of index1
, index2
, and array_number
by reference, while looking like they’re passed as values. If that bothers you (and we can understand why it might), feel free to implement it as a function which returns a struct or class containing all those values. We certainly wouldn’t argue that we’ve implemented anything in the best possible style.
Here’s our implementation.
kth_element()
|
|
The function is a bit complicated, so we’ve labeled subsections with comments in the code. The first four cases are the base cases. Hopefully they all make sense. The first two just handle what happens if one of the arrays has no elements in it. That’s incredibly easy, because the other array is already ordered. The n1
and n2
both equal to 0
, so you end up in case (r-2)
. This gets you into an infinite loop, because if you look at what happens to all of the values of the inputs to the recursive call to kth_element
, they’re all the same as the parent function’s inputs. Thus you need to include base case (d)
.
Now we’re into the search cases. There are two possible cases. They’re designed to guarantee that we eliminate at least n1
and n2
. It’s the same reason we had to use the different functions in the original parallel_merge
given in the previous entry (Parallel Sorting in Theory and in Practice (II)). It’s because in providing an ordering to the two arrays, we have to have a way to handle ties between the two arrays. For example, if array1[i]
is equal to array2[j]
, we have to be consistent in deciding which comes before the other. We have arbitrarily chosen elements of array1
to come before elements of array2
. Therefore, an element of array1 comes after every element of array2
which is strictly less than it. Here’s an example we hope is helpful.
Suppose array1 = [1,1,2,3,5,8,13]
and array2 = [1,2,3,4,5,6,7]
. Then length1
= length2
and they’re both equal to 7, so we’re in search case (s-1)
, this means that n1 = 3
, and that means that array1[n1]
is equal to 3
. There’s also a 3
in array2
! Since we’ve decided that elements of array1
come before elements of array2
, that 3
(in array1
) comes after the first two elements of array2
, the sub-array [1,2]
, (but before the 3
in array2
). This has length 2, which is exactly what is returned by least_upper_bound(array2, array1[n1], length2)
. This is because least_upper_bound(array2, n, length2)
is designed to return the smallest index j
of an element of array2
such that every smaller index has an element which is strictly less than n
. In this way, n
is a least (strictly) upper bound of the array with indices less than j
.
On the other hand, suppose that array2 = [0,1,2,3,4,5,6,7]
, but that array1
is exactly as it was in the previous example. Then we’re no longer in search case (s-1)
, we’re now in search case (s-2)
. Now n2
is equal to 4
, so we’re going to have to find all of the elements of array1
which come before array2[n2]
, which is equal to 3
. Once again, 3
is an element in both arrays. We want to find all of the elements of array1
which come before this 3
in array2
. We’ve decided that any 3
in array1
should come before this 3
, so we want to find the index of the first element which is strictly greater than 3
. That is exactly what min_greater_than(array2, array2[n2], length1)
returns. In this case that will be 4
, which is the index of the 5
in array1
. It is also the length of the initial sub-array [1,1,2,3]
of array1
.
Now we return to considering how many elements we’ve eliminated from the recursive step. Notice that either n1 = length1 / 2
or n2 = lengh2 / 2
, depending on which is larger. If that length is an even number, then half of it is exactly half. Otherwise, if that length is odd, then half of it in C++ is equal to (r-2)
, then the sub-array fed into the recursive call to kth_element
will have length
For example, if we call kth_element
with two arrays of length length1
and length2
, where length1
length2
, then the recursive call to kth_element
will have two arrays, where the length of array1
, length1
, is kth_element
will be
Proof
We can see that the combined size of the input arrays to the recursive calls is
If we let
This means that the right side of equation (1) is equal to
Whereas
Checking the cases where
QED
Using this result we can show that for large enough
If we let
If we apply equation (4) above with the “Almost the Master” Theorem from More on Sorting and Analyzing Run Time Complexity, then we see that
Now we’re ready to see how we use kth_element
to make parallel_merge2
work. Here’s our implementation.
parallel_merge2
|
|
As mentioned in the comments, all of the non-overlapping intervals have exactly the same length, with the possible exception of the last interval. That one has length equal to length - (length / (n_threads*MERGE_BUCKET_MULTIPLIER)) * n_threads*MERGE_BUCKET_MULTIPLIER
. Let
then the length of the last interval is equal to
Now let
which is exactly equal to
This means that the loads for the non-overlapping intervals are evenly balanced, except for the last one, which will have to merge fewer elements. Since each iteration of each for loop handles a different interval, we don’t have to worry about concurrency issues, and there is no need for semaphores! This allows this sorting function to achieve a high degree of parallelism. Indeed, in the real-world test case of sorting an array of 200 million elements, it completed this task in 4.2 seconds. This was on the exact same input with which the other algorithms were tested. Again, we used 8 cores and 16 threads with an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz processor. Nothing exceptionally fancy.
In all the tests, we only counted the number of milliseconds it took to complete the sorting function. We did not include the time it took the data to be loaded into memory. Therefore, they were reasonably fair comparisons between the functions. To recap, we have
Algorithm | Time | Speed Up |
---|---|---|
Parallel merge sort with “parallel merge 2” | 4.2 seconds | 4.8x |
Parallel merge sort with “parallel merge” | 6.5 seconds | 3.1x |
Serial merge sort | 20.0 seconds | |
Serial quick sort | 25.4 seconds | |
Parallel quick sort “plus” | 87.7 seconds | 9.2x |
Serial quick sort “plus” | 811.4 seconds |
You can see that parallel_quick_sort_plus
actually achieved good parallelism with 9.2x speedup on 16 threads. It’s poor performance is simply a result of the underlying function’s poor performance. Don’t use it on real-world applications!
That’s it for parallel sorting, at least for now. If you haven’t read them yet, check out the first two entries on this topic: Parallel Sorting in Theory and in Practice (I) and Parallel Sorting in Theory and in Practice (II).