Parallel Sorting in Theory and in Practice III

Parallel Sorting in Theory and in Practice III
Page content

As promised, the last of a three-part series of entries on sorting in parallel. If you haven’t yet, check out the first two here Parallel Sorting in Theory and in Practice (I) and here Parallel Sorting in Theory and in Practice (II). Here we present a parallel implementation of merge_sort which runs in \(O(n \log_2(n))\) time and achieves fairly good parallelism.

This implementation of parallel_merge_sort is identical to the one presented in part II, except for the parallel_merge function. In the original implementation, we had two parallel for loops, one which iterated over i and calculated j and k, and another one which iterated over j and calculated i and k. In the updated version of merge_sort, what we’ll call “merge_sort2”, we select several almost evenly-spaced k’s, and then calculate both i and j for each of those k’s. In fact, no matter how large the input array, we only calculate i and j for a fixed number of k’s. These are the endpoints of non-overlapping intervals. The threads will loop over these intervals so that each handles a different subset of them. This way we split the total work involved in merging up into several independent sub-arrays. Since we do it only for a fixed amount of k’s, no matter how large the input arrays are, we add at most a constant amount of extra work. The awesome part is, though, that calculating the i’s and the j’s takes only \(O(\log_2(n))\) steps, so we aren’t even adding that much extra work.

The real magic of merge_sort2 comes from the function which calculates the i’s and the j’s. We call it kth_element. The name might seem a bit odd at first, but it does make sense. What it does is, given an input of two sorted arrays, array1 and array2 of lengths length1 and length2, respectively, and ak, it will compute the index of the \(k^{th}\) element. It will also return a value, array_number which identifies which array contains the \(k^{th}\) element.

We’ve implemented it in C++ as a void function which uses the & operator to change the values of index1, index2, and array_number by reference, while looking like they’re passed as values. If that bothers you (and we can understand why it might), feel free to implement it as a function which returns a struct or class containing all those values. We certainly wouldn’t argue that we’ve implemented anything in the best possible style.

Here’s our implementation.

kth_element()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
void kth_element(int64_t* array1, int64_t* array2, int64_t k,
  int64_t length1, int64_t length2,
  int64_t& index1, int64_t& index2, int16_t& array_number)
{
  // base case (a)
  if (length2 < 1)
  {
    index2 = 0;
    index1 = k;
    array_number = 0;
    return;
  }
  // base case (b)
  if (length1 < 1)
  {
    index1 = 0;
    index2 = k;
    array_number = 1;
    return;
  }
  // base case (c)
  if (k == 0)
  {
    index1 = 0;
    index2 = 0;
    if (array1[0] <= array2[0])
    {
      array_number = 0;
    }
    else
    {
      array_number = 1;
    }
    return;
  }
  // base case (d)
  // we know k >= 1
  if ((length1 == 1) && (length2 == 1))
  {
    // k == 1 so find the larger element //
    index1 = 0;
    index2 = 0;
    if (array1[0] <= array2[0])
    {
      array_number = 1;
    }
    else
    {
      array_number = 0;
    }
    return;
  }
  // non-trivial search cases
  int64_t n1, n2;

  // case (s-1)
  if (length1 >= length2)
  {
    n1 = length1 / 2;
    n2 = least_upper_bound(array2, array1[n1], length2);
  }
  // case (s-2)
  else
  {
    n2 = length2 / 2;
    n1 = min_greater_than(array1, array2[n2], length1);
  }
  // recursive cases
  // case (r-1)
  if (k < n1 + n2)
  {
    kth_element(array1, array2, k, n1, n2, index1, index2, array_number);
    return;
  }
  // case (r-2)
  kth_element(&array1[n1], &array2[n2], k - (n1+n2),
    length1 - n1, length2 - n2, index1, index2, array_number);
  index1 += n1;
  index2 += n2;
}

The function is a bit complicated, so we’ve labeled subsections with comments in the code. The first four cases are the base cases. Hopefully they all make sense. The first two just handle what happens if one of the arrays has no elements in it. That’s incredibly easy, because the other array is already ordered. The \(k^{th}\) element is just the element in the \(k^{th}\) position of that array. Easy. The third case is where \(k = 0\), so you just have to compare the two smallest elements of the two arrays. Easy. The third one got us in trouble when we forgot to include it. It’s actually possible to have both arrays have length \(1\), and if \(k\) is also equal to \(1\), then the recursive cases might not get any smaller. That’s because you’re then guaranteed to have n1 and n2 both equal to 0, so you end up in case (r-2). This gets you into an infinite loop, because if you look at what happens to all of the values of the inputs to the recursive call to kth_element, they’re all the same as the parent function’s inputs. Thus you need to include base case (d).

Now we’re into the search cases. There are two possible cases. They’re designed to guarantee that we eliminate at least \(\left\lfloor \frac{\max(\text{length1}, \text{length2})}{2} \right\rfloor\) elements for the next iterative step. We’ll come back to that in a little bit. For now, you might be wondering why we have to use two different functions to calculate n1 and n2. It’s the same reason we had to use the different functions in the original parallel_merge given in the previous entry (Parallel Sorting in Theory and in Practice (II)). It’s because in providing an ordering to the two arrays, we have to have a way to handle ties between the two arrays. For example, if array1[i] is equal to array2[j], we have to be consistent in deciding which comes before the other. We have arbitrarily chosen elements of array1 to come before elements of array2. Therefore, an element of array1 comes after every element of array2 which is strictly less than it. Here’s an example we hope is helpful.

Suppose array1 = [1,1,2,3,5,8,13] and array2 = [1,2,3,4,5,6,7]. Then length1 = length2 and they’re both equal to 7, so we’re in search case (s-1), this means that n1 = 3, and that means that array1[n1] is equal to 3. There’s also a 3 in array2! Since we’ve decided that elements of array1 come before elements of array2, that 3 (in array1) comes after the first two elements of array2, the sub-array [1,2], (but before the 3 in array2). This has length 2, which is exactly what is returned by least_upper_bound(array2, array1[n1], length2). This is because least_upper_bound(array2, n, length2) is designed to return the smallest index j of an element of array2 such that every smaller index has an element which is strictly less than n. In this way, n is a least (strictly) upper bound of the array with indices less than j.

On the other hand, suppose that array2 = [0,1,2,3,4,5,6,7], but that array1 is exactly as it was in the previous example. Then we’re no longer in search case (s-1), we’re now in search case (s-2). Now n2 is equal to 4, so we’re going to have to find all of the elements of array1 which come before array2[n2], which is equal to 3. Once again, 3 is an element in both arrays. We want to find all of the elements of array1 which come before this 3 in array2. We’ve decided that any 3 in array1 should come before this 3, so we want to find the index of the first element which is strictly greater than 3. That is exactly what min_greater_than(array2, array2[n2], length1) returns. In this case that will be 4, which is the index of the 5 in array1. It is also the length of the initial sub-array [1,1,2,3] of array1.

Now we return to considering how many elements we’ve eliminated from the recursive step. Notice that either n1 = length1 / 2 or n2 = lengh2 / 2, depending on which is larger. If that length is an even number, then half of it is exactly half. Otherwise, if that length is odd, then half of it in C++ is equal to \( \left\lfloor\frac{\text{length}}{2}\right\rfloor \). This means that if we end up in recursive case (r-2), then the sub-array fed into the recursive call to kth_element will have length \(\left\lceil \frac{\text{length}}{2} \right\rceil\). The other array can have any length \(\leq \) its original length fed into the function at the top.

For example, if we call kth_element with two arrays of length length1 and length2, where length1 \(\geq\) length2, then the recursive call to kth_element will have two arrays, where the length of array1, length1, is \(\leq \left\lceil \frac{ \text{length1} }{2} \right\rceil \). What this means is that if we let \(n = \text{length1} + \text{length2} \), then the combined size of the inputs to the recursive call to kth_element will be \( \leq \left\lceil \frac{3n}{4} \right\rceil + 1\).

Proof

We can see that the combined size of the input arrays to the recursive calls is

\[ \leq n - \left\lfloor { \frac{ { \left\lceil \frac{n}{2} \right\rceil } }{2} } \right\rfloor \tag{1}\label{1} \]

If we let \(n = 4q + r\), where \(0 \leq r < 3\), then \( \left\lceil \frac{n}{2}\right\rceil \) is equal to \(2q + \left\lceil \frac{r}{2} \right\rceil \). Then, dividing by \(2\) and applying the floor function gets us

\[q + \left\lfloor { \frac{ { \left\lceil \frac{r}{2} \right\rceil } }{2} } \right\rfloor \]

This means that the right side of equation (1) is equal to

\[ 3q + r - \left\lfloor { \frac{ { \left\lceil \frac{r}{2} \right\rceil } }{2} } \right\rfloor \tag{2}\label{2} \]

Whereas \( \left\lceil \frac{3n}{4}\right\rceil + 1 \) is equal to

\[ 3q + \left\lceil \frac{r}{4} \right\rceil + 1 \tag{3}\label{3} \]

Checking the cases where \(r = 0, 1, 2\), or \(3\), we see that \( n - \left\lfloor { \frac{ { \left\lceil \frac{n}{2} \right\rceil } }{2} } \right\rfloor \) is less than or equal to \( \left\lceil \frac{3n}{4} \right\rceil + 1\).

QED

Using this result we can show that for large enough \(n\), we have \( n - \left\lfloor { \frac{ { \left\lceil \frac{n}{2} \right\rceil } }{2} } \right\rfloor \) is less than or equal to \( \left\lceil \frac{n}{1.3} \right\rceil \).

If we let \(T(n)\) represent, as usual, the number of steps it takes kth_element to finish on an input whose combined length is \(n\), then we have

\[T(n) \leq cn^0 + T(\left\lceil \frac{n}{1.3} \right\rceil) \tag{4}\label{4}\]

If we apply equation (4) above with the “Almost the Master” Theorem from More on Sorting and Analyzing Run Time Complexity, then we see that \(T(n) \in O( \log_2(n) ) \).

Now we’re ready to see how we use kth_element to make parallel_merge2 work. Here’s our implementation.

parallel_merge2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
void parallel_merge2(int64_t* array, int64_t* array_copy,
  int64_t length, int16_t n_threads)
{
  if (n_threads < 2 || length <= n_threads*MERGE_BUCKET_MULTIPLIER*32)
  {
    int64_t h = length / 2;
    serial_merge(array_copy, &array_copy[h], array, h, length - h);
    return;
  }
  
  int64_t l_indices[n_threads*MERGE_BUCKET_MULTIPLIER + 1];
  int64_t r_indices[n_threads*MERGE_BUCKET_MULTIPLIER + 1];

  int64_t h = length / 2;
  int64_t i, j, k, k_max, e1, e2, n;

  l_indices[0] = 0;
  r_indices[0] = 0;
  l_indices[n_threads*MERGE_BUCKET_MULTIPLIER] = h;
  r_indices[n_threads*MERGE_BUCKET_MULTIPLIER] = length - h;

  /*
    In this parallel for loop, we calculate the endpoints of
    n_threads*MERGE_BUCKET_MULTIPLIER non-overlapping
    intervals. They will all have the same total length of
    length / (n_threads*MERGE_BUCKET_MULTIPLIER), except
    for possibly the last interval, which will have length
    length - (length / (n_threads*MERGE_BUCKET_MULTIPLIER))
    * n_threads*MERGE_BUCKET_MULTIPLIER).
  */
  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k, e1, e2) 
    shared(array, array_copy, length, n_threads)
  for (i = 1; i < n_threads*MERGE_BUCKET_MULTIPLIER; i++)
  {
    k = i*(length / (n_threads*MERGE_BUCKET_MULTIPLIER));
    int16_t array_number = 0;
    kth_element(array_copy, &array_copy[h], k, h, length - h, e1, e2, array_number);

    if (array_number == 0)
    {
      l_indices[i] = e1;
      r_indices[i] = k - e1;
    }
    else
    {
      r_indices[i] = e2;
      l_indices[i] = k - e2;
    }
  }
  /*
    In this parallel for loop, we perform seral_merge on the
    n_threads*MERGE_BUCKET_MULTIPLIER non-overlapping
    intervals.
  */
  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k, k_max, e1, e2, n) 
    shared(array, array_copy, length, n_threads, h, l_indices, r_indices)
  for (i = 0; i < n_threads*MERGE_BUCKET_MULTIPLIER; i++)
  {
    k  = l_indices[i] + r_indices[i];
    e1 = l_indices[i];
    e2 = r_indices[i];

    serial_merge(&array_copy[e1], &array_copy[e2 + h], &array[e1 + e2], 
      l_indices[i+1] - e1, r_indices[i+1] - e2);
  }
}

As mentioned in the comments, all of the non-overlapping intervals have exactly the same length, with the possible exception of the last interval. That one has length equal to length - (length / (n_threads*MERGE_BUCKET_MULTIPLIER)) * n_threads*MERGE_BUCKET_MULTIPLIER. Let

$$ \begin{align} l &= \text{length} \\ t &= \text{n_threads} \\ B &= \text{ MERGE_BUCKET_MULTIPLIER } \\ \end{align} $$

then the length of the last interval is equal to

\[ l - tB \left\lfloor \frac{l}{tB} \right\rfloor \tag{5}\label{5} \].

Now let \(l = tB \cdot q + r \), where \( 0 \leq r < tB \), then equation (5) becomes

\[ (tB)q + r - tB( q + \left\lfloor \frac{r}{tB} \right\rfloor ) \]

which is exactly equal to \(r\), because \(\left\lfloor \frac{r}{tB} \right\rfloor = 0 \).

This means that the loads for the non-overlapping intervals are evenly balanced, except for the last one, which will have to merge fewer elements. Since each iteration of each for loop handles a different interval, we don’t have to worry about concurrency issues, and there is no need for semaphores! This allows this sorting function to achieve a high degree of parallelism. Indeed, in the real-world test case of sorting an array of 200 million elements, it completed this task in 4.2 seconds. This was on the exact same input with which the other algorithms were tested. Again, we used 8 cores and 16 threads with an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz processor. Nothing exceptionally fancy.

In all the tests, we only counted the number of milliseconds it took to complete the sorting function. We did not include the time it took the data to be loaded into memory. Therefore, they were reasonably fair comparisons between the functions. To recap, we have

Algorithm Time Speed Up
Parallel merge sort with “parallel merge 2” 4.2 seconds 4.8x
Parallel merge sort with “parallel merge” 6.5 seconds 3.1x
Serial merge sort 20.0 seconds
Serial quick sort 25.4 seconds
Parallel quick sort “plus” 87.7 seconds 9.2x
Serial quick sort “plus” 811.4 seconds

You can see that parallel_quick_sort_plus actually achieved good parallelism with 9.2x speedup on 16 threads. It’s poor performance is simply a result of the underlying function’s poor performance. Don’t use it on real-world applications!

That’s it for parallel sorting, at least for now. If you haven’t read them yet, check out the first two entries on this topic: Parallel Sorting in Theory and in Practice (I) and Parallel Sorting in Theory and in Practice (II).