Parallel Sorting in Theory and in Practice II

Parallel Sorting in Theory and in Practice II
Page content

As promised, this is the second entry on parallel sorting. In this entry, we’ll implement merge_sort, and then give two different ways to make it run in parallel. The first one will be a bit simpler than the second one.

Without further ado, here’s our implementation of merge_sort in C++

merge_sort()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Polymorphic wrapper //
void merge_sort(int64_t* array, int64_t length)
{

  int64_t* array_copy = new int64_t[length];

  merge_sort(array, array_copy, length);

  delete array_copy;
}

// The real function //
void merge_sort(int64_t* array, int64_t* array_copy, int64_t length)
{
  if (length < 2)
  {
    return;
  }
  int64_t h = length / 2;
  int64_t i;

  for(i = 0; i < length; i++)
  {
    array_copy[i] = array[i];
  }

  merge_sort(array_copy, array, h);
  merge_sort(&array_copy[h], &array[h], length - h);

  serial_merge(array_copy, &array_copy[h], array, h, length - h);
}

This is a great introduction to recursion. It starts by copying the original array to array_copy, then recursively sends the left half of it to merge_sort as well as the right half. So when those two functions finish, the left half of array – everything from 0 to h-1 is sorted among themselves, and likewise, everything from h to length-1 is sorted among themselves. For example, if our input was array = {2,0,1,4,3,4,3,2,1,0}, then after the two recursive calls to quick_sort we would have array_copy = {0,1,2,3,4,0,1,2,3,4}, because the two halves have been sorted.

The function is finished when serial_merge has finished. Here’s our implementation of that function.

serial_merge()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
void serial_merge(int64_t* l_array_copy, int64_t* r_array_copy,
  int64_t* array, int64_t l_length, int64_t r_length)
{
  int64_t i, j, k;
  i = 0;
  j = 0; 
  k = 0;
  while((i < l_length) && (j < r_length))
  {
    if(l_array_copy[i] <= r_array_copy[j])
    {
      array[k] = l_array_copy[i];
      i++;
      k++;
    }
    else
    {
      array[k] = r_array_copy[j];
      j++;
      k++;
    }
  }
  while(i < l_length)
  {
    array[k] = l_array_copy[i];
    i++;
    k++;
  }
  while(j < r_length)
  {
    array[k] = r_array_copy[j];
    j++;
    k++;
  }
}

You’ll notice three while loops. In the first one we step through l_array_copy with index i and r_array_copy with index j. The first while loop exits when either i or j gets to the end of its respective array. What’s going on is that we’re comparing what’s at the \(i^{th}\) position of l_array with what’s at the \(j^{th}\) position of r_array_copy and whichever is smaller – with ties going to what’s in l_array_copy – getting copied over to array. That’s the heart of the entire algorithm. The next two while loop just take care of copying whatever is left over in l_array_copy if j got to the end of r_array_copy before i got to the end of l_array_copy. Similarly, the final while loop takes care of copying over the remainder of r_array_copy if i got to the end of l_array_copy first.

Notice that at every step of every while loop, either i or j and definitely k is being increased. Thus all three while loops take at most l_length + r_length to complete. We see from merge_sort that l_length = h and r_length = length - h, so their total is exactly equal to length.

Therefore, if \(T(n)\) represents the number of steps it takes merge_sort to finish on an input array of length \(n\), then we see that

\[ T(n) = cn + 2T(\left\lceil \frac{n}{2} \right\rceil) \tag{1}\label{1} \]

By our “Almost the Master” Theorem (from More on Sorting and Analyzing Run Time Complexity), we see that \(T(n) \in O(n\log_2(n)) \).

merge_sort is an excellent algorithm. You should use it whenever you have the available memory. That’s the main drawback with using it: it requires us to make a copy of our original input. If your memory budget is tight, this might be a deal breaker. It has been for me at times.

Now we’re going to dive into the much deeper waters of trying to make this thing run efficiently in parallel – something which at first probably seems impossible. It certainly did to us. Before we get going, though, we’ll need some useful functions. The first two are least_upper_bound and min_greater_than. Here’s our implementation of them in C++

least_upper_bound() and min_greater_than()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void parallel_merge(int64_t* array, int64_t* array_copy, int64_t length, int16_t n_threads)
{
  int64_t h = length / 2;

  if (n_threads < 2 || length <= n_threads*1024)
  {
    serial_merge(array_copy, &array_copy[h], array, h, length - h);
    return;
  }

  int64_t i,j,k;

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (i = 0; i < h; i++)
  {
    j = least_upper_bound(&array_copy[h], array_copy[i], length - h);
    k = i + j;
    array[k] = array_copy[i];
  }

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (j = h; j < length; j++)
  {
    i = min_greater_than(array_copy, array_copy[j], h);
    k = (i + j - h);
    array[k] = array_copy[j];
  }
}

The comments in the code explain exactly what these two functions do. They’re very similar, the only difference is that the first involves \( \leq\), while the second involves \( < \). If \(T(n) \) represents the number of steps it takes each one to finish with an input array of length \(n\), then it’s easily seen that \(T(n) = cn^{0} + T(\left\lceil \frac{n}{2}\right\rceil) \), which means that \(T(n) \in O( \log_2(n)) \) (again, using the “Almost the Master” Theorem).

In order to allow us to merge two arrays in parallel, we’re going to handle the indices i, j, and k from serial_merge differently. We’re going to step through the i index and the j index in completely separate for loops. We’ll use the value of one to compute the value of the other. What we mean is that we will have a for (i = 0; i < h; i++) loop and in that loop we’ll calculate j and k as a function of i. Then we’ll have a completely separate for (j = h; j < length; j++) loop and in that loop we’ll calculate i and k as functions of j. Calculating i as a function of j will take \(O(\log_2(\text{length} - \text{h})) \) steps, and calculating j as a function of i will take \(O(\log_2(\text{h}))\) steps.

Here is our implementation.

parallel_merge()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void parallel_merge(int64_t* array, int64_t* array_copy, int64_t length, int16_t n_threads)
{
  int64_t h = length / 2;

  if (n_threads < 2 || length <= n_threads*1024)
  {
    serial_merge(array_copy, &array_copy[h], array, h, length - h);
    return;
  }

  int64_t i,j,k;

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (i = 0; i < h; i++)
  {
    j = least_upper_bound(&array_copy[h], array_copy[i], length - h);
    k = i + j;
    array[k] = array_copy[i];
  }

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (j = h; j < length; j++)
  {
    i = min_greater_than(array_copy, array_copy[j], h);
    k = (i + j - h);
    array[k] = array_copy[j];
  }
}

Since \(\text{length} - \text{h} \) and \(\text{h}\) are both \( \leq \left\lceil \frac{\text{length}}{2}\right\rceil\), we can see that \(T(n)\), the number of steps it takes parallel_merge to finish on an input of size \(n = \text{length}\) is in \(O(n \log_2(n))\). We have to use the two different functions, least_upper_bound and min_greater_than because we have to let ties in our merging function consistently go to one half of array_copy. We have arbitrarily chosen the left half of the array to win on ties. Therefore, for a given i the j index should point to the earliest element of the right half of array_copy which is greater than or equal to array_copy[i]. Whereas, when we’re calculating i for a given j, we have to point to the earliest element which is strictly greater than array_copy[j].

Now we’re ready to implement parallel_merge_sort. Here’s our implementation

parallel_merge_sort()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Polymorphic wrapper //
void parallel_merge_sort(int64_t* array, int64_t length, int16_t n_threads)
{
  if (!array) return;
  if (length < 2) return;
  int16_t max_threads = omp_get_max_threads();
  int16_t threads = (n_threads < max_threads) ? n_threads : max_threads;

  int64_t* array_copy = new int64_t[length];

  parallel_merge_sort(array, array_copy, length, threads);

  // clean up!!
  delete array_copy;
}

// The real function //
void parallel_merge_sort(int64_t* array, int64_t* array_copy,
  int64_t length, int16_t n_threads)
{
  if (length < 2)
  {
    return;
  }

  int64_t h = length / 2;
  int64_t i;

  omp_set_num_threads(n_threads);

  #pragma omp parallel for private(i) shared(array_copy, array)
  for (i = 0; i < length; i++)
  {
    array_copy[i] = array[i];
  }

  if (n_threads > 1)
  {
    omp_set_nested(1);
    #pragma omp parallel for private(i)
      shared(array, array_copy, h, n_threads)
    for (i = 0; i < 2; i++)
    {
      if (i ==0)
      {
        parallel_merge_sort(array_copy, array, h, n_threads / 2);
      }
      else
      {
        parallel_merge_sort(&array_copy[h], &array[h],
          length - h, n_threads - (n_threads / 2));
      }
    }
  }
  else
  {
    merge_sort(array_copy, array, h);
    merge_sort(&array_copy[h], &array[h], length - h);
  }

  parallel_merge(array, array_copy, length, n_threads);
}

Because the parallel_merge function we’ve provided above takes \(O(n\log(n))\) steps to complete, \(T(n)\) for the entire function is equal to \( cn\log_2(n) + 2 T( \left\lceil \frac{n}{2}\right\rceil) \), which means that by case # 4 of the “Almost the Master” Theorem – provided at the very end of Parallel Sorting in Theory and in Practice (I) – \(T(n) \in O(n \log_2(n)^{2}) \). This means that asymptotically this will perform worse than even parallel_quick_sort_plus, even though it might require a very large input before it’s actually slower. For example, on an input array of 200 million elements, it runs in 6.5 seconds with 8 cores and 16 threads. This is the same Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz processor, and in fact, the same test input, so it’s fair to compare the times. That’s 13.5 times faster than parallel_quick_sort_plus!

Unfortunately, we won’t get to the even faster version of parallel_merge_sort in this entry. It will have to wait for the next one.

That’s all for this entry!