Parallel Sorting in Theory and in Practice I

Parallel Sorting in Theory and in Practice I
Page content

We’re going to begin our discussion of parallel algorithms. We’ll do this by giving a parallel version of quick_sort_plus. In the next entry (Parallel Sorting in Theory and in Practice (II)), we’ll introduce another sorting algorithm, merge_sort, and show how it can be made to run in parallel in a couple of different ways. The first is simpler, but isn’t as fast as the second, more complicated implementation. The parallel function we provide in this entry probably shouldn’t be used in a real-world situation. It’s not actually that fast. What’s more, the one advantage that the serial version of quick_sort has over merge_sort, the fact that it doesn’t require much memory to work doesn’t even hold. That’s because it calls a function, parallel_pivot, which has to make a copy of the input array, and so it uses about as much extra memory as merge_sort does. We start with it parallel_quick_sort_plus, because it’s simpler to understand than parallel_merge_sort.

Here’s a parallel implementation quick_sort_plus in C++

Parallel quick_sort_plus()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
void parallel_quick_sort_plus(int64_t* array, int64_t length, int16_t n_threads)
{
  if (!array) return;
  if (length < 2) return;

  int16_t max_threads = omp_get_max_threads();
  int16_t threads = (max_threads < n_threads) ? max_threads : n_threads;
  int16_t l_threads, r_threads;

  omp_set_num_threads(threads);

  int64_t h = length / 2;

  parallel_kth_smallest(array, h, length, threads);

  if (threads > 1)
  {
    l_threads = threads / 2;
    r_threads = threads - l_threads;

    int i;
    omp_set_nested(1);
    #pragma omp parallel for private(i) shared(l_threads, r_threads, array, length)
    for (i = 0; i < 2; i++)
    {
      if (i == 0)
      {
        omp_set_num_threads(l_threads);
        parallel_quick_sort_plus(array, h, l_threads);
      }
      else
      {
        omp_set_num_threads(r_threads);
        parallel_quick_sort_plus(&array[h], length - h, r_threads);
      }
    }
  }
  else
  {
    // do it serially //
    quick_sort(array, h);
    quick_sort(&array[h], length - h);
  }
}

We use omp.h to make it run in parallel. You probably noticed the bizarre use of for (i = 0; i < 2; i++) with an if else statement. We didn’t want to write it that way, but we couldn’t get nested parallelism to work when we tried implementing the function with parallel sections or tasks. Without nested parallelism, the recursive calls wouldn’t run in parallel.

There’s nothing earth-shattering going on here. If you can follow the serial version of quick_sort_plus (given here Finding the Median and an Application to Sorting), then this should make sense.

You’ll notice we call parallel_kth_smallest it’s a bit more complicated than the serial version of it. It provides the evenly-split pivot, which makes quick_sort_plus work in guaranteed (O(n\log_2(n))) steps.

Here’s our implementation of that function.

parallel_kth_smallest()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
void parallel_kth_smallest(int64_t* array, int64_t k,
  int64_t length, int16_t n_threads)
{
  // CHUNK_SIZE is usually 5
  if (length <= CHUNK_SIZE)
  {
    bubble_sort(array, length);
    return;
  }
  int16_t max_threads = omp_get_max_threads();
  int16_t threads = (max_threads < n_threads) ? max_threads : n_threads;
  int16_t tid;

  omp_set_num_threads(threads);

  int64_t m_length = ceil(length, CHUNK_SIZE);
  int64_t* medians = new int64_t[m_length];
  int64_t i, j, tmp_length, m_index;

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, tmp_length, m_index)
    shared(array, medians, m_length)
  for (j = 0; j < m_length; j++)
  {
    i = CHUNK_SIZE*j;
    tmp_length = ((i + CHUNK_SIZE) <= length) ? CHUNK_SIZE : length - i;
    m_index = i + (tmp_length / 2);
    bubble_sort(&array[i], tmp_length);
    medians[j] = array[m_index];
  }

  // Now for the recursive step
  parallel_kth_smallest(medians, m_length / 2, m_length, threads);

  // Now for parallel pivot!

  m_index = parallel_pivot(array, medians[m_length / 2], length, threads);

  // clean up medians!
  delete medians;

  if (m_index == k) return;

  // k is to the right side
  if (m_index < k)
  {
    parallel_kth_smallest(&array[m_index+1], k - (m_index + 1),
      length - (m_index + 1), threads);
    return;
  }

  // k is to the left side
  parallel_kth_smallest(array, k, m_index, threads);
}

The main difference between this and the serial version is that we compute the medians array in parallel. Because we’re computing in parallel, we cannot move the medians to the initial segment of array, so we have to allocate a new array for them. We’ll have to allocate another array for similar reasons in parallel_pivot.

Here’s our implementation of that function.

parallel_pivot()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
int64_t parallel_pivot(int64_t* array, int64_t value,
  int64_t length, int16_t n_threads)
{
  if (length < 2) return 0;
  int16_t max_threads = omp_get_max_threads();
  int16_t threads = (max_threads < n_threads) ? max_threads : n_threads;

  //omp_set_num_threads(threads);

  int64_t i, tid, offset;
  int64_t smaller_count[threads+1];
  int64_t larger_count[threads+1];

  // Doing this in parallel isn't worth it.
  for (i = 0; i < threads+1; i++)
  {
    smaller_count[i] = 0;
    larger_count[i] = 0;
  }

  #pragma omp parallel for private(i, tid)
    shared(smaller_count, array, length)
  for (i = 0; i < length; i++)
  {
    tid = omp_get_thread_num();
    if (array[i] < value)
    {
      smaller_count[tid+1]++;
    }
    else{
      larger_count[tid+1]++;
    }
  }

  // Doing this in parallel isn't worth it.
  for (i = 1; i < threads+1; i++)
  {
    smaller_count[i] += smaller_count[i-1];
    larger_count[i] += larger_count[i-1];
  }
  /*
  smaller_count[threads] = the total number of elements less than value
  this is the value we will return.
  */

  int64_t* array_copy = new int64_t[length];

  #pragma omp parallel for private(i, tid, offset)
    shared(smaller_count, array, array_copy, length, threads)
  for (i = 0; i < length; i++)
  {
    tid = omp_get_thread_num();
    if (array[i] < value)
    {
      // offset equals the old number, then we increase it by 1
      offset = smaller_count[tid]++;
    }
    else
    {
      // offset equals the old number, then we increase it by 1
      offset = smaller_count[threads] + (larger_count[tid]++);
    }
    array_copy[offset] = array[i];
    
  }

  #pragma omp parallel for private(i) shared(array, array_copy)
  for (i = 0; i < length; i++)
  {
    array[i] = array_copy[i];
  }

  // clean up!!
  delete array_copy;

  return smaller_count[threads];
}

We perform a “pivot” in parallel, i.e. move everything less than value to the left section of array, and everything else to the right by first creating two arrays of counters – two for each thread. We’re going to have each thread count how many smaller values it sees and how many larger values it sees. This obviously works in parallel, because each thread is updating different elements of the two arrays. Then, once we’re done with that, we perform what is sort of like an integral. We end up with

\[ \text{smaller_count}[i] = \sum\limits_{j = 0}^{i-1} \text{smaller}_j\]

where \(\text{smaller}_j\) is the number of elements less than value seen by thread \(j\). It’s similar for \(\text{larger_count}[i]\).

When the scheduling of the threads over the elements of array remains fixed from one for loop to another, this allows us to move the elements from array to array_copy without having to worry about any two threads writing to the same place. Fortunately for us, the default omp schedule does stay the same with for loops of the same length. If you’re using something other than omp, or just want to make sure that this works properly, you’ll have to set the schedule yourself.

This is what we do in the next for loop, and then there’s a final for loop which copies the values of array_copy back to array, so that it’s in the state it should be.

This function returns smaller_count[threads], which is equal to the total number of elements of array which are strictly less than value. With this, we have made quick_sort_plus parallel.

Unfortunately, there’s not much of a reason to implement quick_sort_plus in parallel, because it’s based on quick_sort_plus, which is quite slow for an \(O(n\log_2(n))\) algorithm. We ran a test, sorting 200 million elements. Ordinary quick_sort finished in 25.4 seconds, while our parallel_quick_sort_plus took a whopping 87.7 seconds with 8 cores and 16 threads (Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz). That’s not a lot of threads, but even if you’re running on a computer with hundreds or perhaps thousands of threads, there’s still no point in using this function, because parallel merge_sort would be faster.

We’ll implement merge_sort in the next entry. For the rest of this entry, we will revisit the “Almost the Master Theorem”. We originally proved it in More on Sorting and Analyzing Run Time Complexity.

Let’s consider what happens when \(T(n) = f(n) + aT(\left\lceil \frac{n}{b}\right\rceil)\) is a recurrence relation, \(f(n) = cn^{\alpha}\log_2(n)\) and \(a = b^{\alpha}\). In this case the “Almost the Master Theorem” tells us that if we instead consider \(f_{\epsilon}(n) = cn^{\alpha + \epsilon}\), for some tiny \(\epsilon > 0\), then \(T(n) \in O(f_{\epsilon}(n)) \). That isn’t quite satisfying enough.

Notice that when we consider \(T(b^k)\), as we did in More on Sorting and Analyzing Run Time Complexity (equation 5), we get

\[ T(b^k) = c\sum\limits_{i=0}^{k} a^{i}(b^{\alpha})^{k-i} \log_b(b^{k-i})\log_2(b)) \tag{1}\label{1}\]

which is identical to equation (5) from that entry, except for the \(\log\) terms at the end. Also, notice that \(\log_2(b)\) can be absorbed into the constant \(c\), so the base of the logarithm can be changed to \(b\) without affecting much.

Since we assumed that \(a = b^{\alpha}\), we have

\[ a^{i}(b^{\alpha})^{k-i} = (b^{\alpha})^{i} (b^{\alpha})^{k-i} = (b^{\alpha})^k = (b^k)^{\alpha} \tag{2}\label{2}\]

If we let \(c’ = c\log_2(b)\), then combining equations (1) and (2) gives us

\[ c’(b^k)^{\alpha} \sum\limits_{i=0}^{k} \log_b(b^{k-i}) \tag{3}\label{3} \]

Which is easily seen to give us \[ c’(b^k)^{\alpha} \sum\limits_{i=0}^{k} {k-i} \tag{4}\label{4} \]

If we do the sum in the opposite order, we get

\[ c’(b^k)^{\alpha} \sum\limits_{i=0}^{k} {i} \tag{5}\label{5} \]

A nice arithmetic sum, which is equal to

\[ c’(b^k)^{\alpha} \frac{k(k+1)}{2} \tag{6}\label{6}\] Substituting \(n = b^k\), then gives us \[ T(n) = c’n^{\alpha} \frac{ \log_b(n)(\log_b(n) + 1) }{ 2 } \tag{7}\label{7}\]

So far, we’ve only shown that this holds when \(n\) is a power of \(b\), but in a very similar way to how we proved case 3 of the “Almost the Master Theorem” in More on Sorting and Analyzing Run Time Complexity, combining lemma (2) with lemma (1) from that entry, we have

\[ T(n) \leq c’b^{\alpha}(n)^{\alpha} [ \frac{(\log_b(n)+1)\log_b(n)}{2} + \log_b(n) + 1 ] \tag{8}\label{8}\]

This can be shown to be in

\[O(\log_2(n)^{2} n^{\alpha}) \tag{9}\label{9}\]

So, for now, this will be case 4 of the “Almost the Master Theorem”.

To recap:

Case #4 (almost the master theorem)

Let \(T(n) = f(n) + aT(\left\lceil \frac{n}{b}\right\rceil) \), where \( f(b) = cn^{\alpha}\log_2(n)\), then

\(T(n) \in O(\log_2(n)^{2}n^{\alpha}) \)

Proof

Given above.

That’s it for this entry. We continue this discussion in the next entry, Parallel Sorting in Theory and in Practice (II).