Parallel Sorting in Theory and in Practice II
As promised, this is the second entry on parallel sorting. In this entry, we’ll implement merge_sort
, and then give two different ways to make it run in parallel. The first one will be a bit simpler than the second one.
Without further ado, here’s our implementation of merge_sort
in C++
merge_sort()
|
|
This is a great introduction to recursion. It starts by copying the original array
to array_copy
, then recursively sends the left half of it to merge_sort
as well as the right half. So when those two functions finish, the left half of array – everything from 0
to h-1
is sorted among themselves, and likewise, everything from h
to length-1
is sorted among themselves. For example, if our input was array = {2,0,1,4,3,4,3,2,1,0}
, then after the two recursive calls to quick_sort
we would have array_copy = {0,1,2,3,4,0,1,2,3,4}
, because the two halves have been sorted.
The function is finished when serial_merge
has finished. Here’s our implementation of that function.
serial_merge()
|
|
You’ll notice three while loops. In the first one we step through l_array_copy
with index i
and r_array_copy
with index j
. The first while loop exits when either i
or j
gets to the end of its respective array. What’s going on is that we’re comparing what’s at the \(i^{th}\) position of l_array
with what’s at the \(j^{th}\) position of r_array_copy
and whichever is smaller – with ties going to what’s in l_array_copy
– getting copied over to array
. That’s the heart of the entire algorithm. The next two while loop just take care of copying whatever is left over in l_array_copy
if j
got to the end of r_array_copy
before i
got to the end of l_array_copy
. Similarly, the final while loop takes care of copying over the remainder of r_array_copy
if i
got to the end of l_array_copy
first.
Notice that at every step of every while loop, either i
or j
and definitely k
is being increased. Thus all three while loops take at most l_length + r_length
to complete. We see from merge_sort
that l_length = h
and r_length = length - h
, so their total is exactly equal to length
.
Therefore, if \(T(n)\) represents the number of steps it takes merge_sort to finish on an input array of length \(n\), then we see that
\[ T(n) = cn + 2T(\left\lceil \frac{n}{2} \right\rceil) \tag{1}\label{1} \]
By our “Almost the Master” Theorem (from More on Sorting and Analyzing Run Time Complexity), we see that \(T(n) \in O(n\log_2(n)) \).
merge_sort
is an excellent algorithm. You should use it whenever you have the available memory. That’s the main drawback with using it: it requires us to make a copy of our original input. If your memory budget is tight, this might be a deal breaker. It has been for me at times.
Now we’re going to dive into the much deeper waters of trying to make this thing run efficiently in parallel – something which at first probably seems impossible. It certainly did to us. Before we get going, though, we’ll need some useful functions. The first two are least_upper_bound
and min_greater_than
. Here’s our implementation of them in C++
least_upper_bound() and min_greater_than()
|
|
The comments in the code explain exactly what these two functions do. They’re very similar, the only difference is that the first involves \( \leq\), while the second involves \( < \). If \(T(n) \) represents the number of steps it takes each one to finish with an input array of length \(n\), then it’s easily seen that \(T(n) = cn^{0} + T(\left\lceil \frac{n}{2}\right\rceil) \), which means that \(T(n) \in O( \log_2(n)) \) (again, using the “Almost the Master” Theorem).
In order to allow us to merge two arrays in parallel, we’re going to handle the indices i
, j
, and k
from serial_merge
differently. We’re going to step through the i
index and the j
index in completely separate for loops. We’ll use the value of one to compute the value of the other. What we mean is that we will have a for (i = 0; i < h; i++)
loop and in that loop we’ll calculate j
and k
as a function of i
. Then we’ll have a completely separate for (j = h; j < length; j++)
loop and in that loop we’ll calculate i
and k
as functions of j
. Calculating i
as a function of j
will take \(O(\log_2(\text{length} - \text{h})) \) steps, and calculating j
as a function of i
will take \(O(\log_2(\text{h}))\) steps.
Here is our implementation.
parallel_merge()
|
|
Since \(\text{length} - \text{h} \) and \(\text{h}\) are both \( \leq \left\lceil \frac{\text{length}}{2}\right\rceil\), we can see that \(T(n)\), the number of steps it takes parallel_merge
to finish on an input of size \(n = \text{length}\) is in \(O(n \log_2(n))\). We have to use the two different functions, least_upper_bound
and min_greater_than
because we have to let ties in our merging function consistently go to one half of array_copy
. We have arbitrarily chosen the left half of the array to win on ties. Therefore, for a given i
the j
index should point to the earliest element of the right half of array_copy
which is greater than or equal to array_copy[i]
. Whereas, when we’re calculating i
for a given j
, we have to point to the earliest element which is strictly greater than array_copy[j]
.
Now we’re ready to implement parallel_merge_sort
. Here’s our implementation
parallel_merge_sort()
|
|
Because the parallel_merge
function we’ve provided above takes \(O(n\log(n))\) steps to complete, \(T(n)\) for the entire function is equal to \( cn\log_2(n) + 2 T( \left\lceil \frac{n}{2}\right\rceil) \), which means that by case # 4 of the “Almost the Master” Theorem – provided at the very end of Parallel Sorting in Theory and in Practice (I) – \(T(n) \in O(n \log_2(n)^{2}) \). This means that asymptotically this will perform worse than even parallel_quick_sort_plus
, even though it might require a very large input before it’s actually slower. For example, on an input array of 200 million elements, it runs in 6.5 seconds with 8 cores and 16 threads. This is the same Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz processor, and in fact, the same test input, so it’s fair to compare the times. That’s 13.5 times faster than parallel_quick_sort_plus
!
Unfortunately, we won’t get to the even faster version of parallel_merge_sort
in this entry. It will have to wait for the next one.
That’s all for this entry!