CMPS 144/240 Introduction to Parallel Computing

In parallel computing, we have the notion of many processors working together, simultaneously, to solve the same problem.

Example 1: Array Initialization

Consider the problem of initializing each element of an int array a[0..N) to the value indicating its own location. The obvious solution in Java is this:

for (int i=0;  i != N;  i = i+1)
   { a[i] = i; }

This code segment runs in O(N) time.

Now suppose that we had N processors, rather than only one, and that the processors were identified by the integers 0 through N−1. Suppose, further, that each processor "knew" its own identity. We can imagine processor i being responsible for placing the value i into a[i], so that all array elements could be filled simultaneously! We might write the code (using a "for all ... in parallel" construct) as follows:

for all i in [0..N) in parallel 
   { a[i] = i; }

What the above expresses is the computation in which, simultaneously, each array element is filled with the value corresponding to its own location. Thus, its running time is O(1). We see, then, that, with N processors, an array of length N can be initialized in O(1) time.

Suppose that only N/2 processors were available. Then we could assign processor i (for each i in [0..N/2)) to fill elements 2i and 2i+1. Then the computation would be finished in two units of time rather than one.

More generally, if only N/c processors were available, for some constant c ≥ 1, we could assign processor i, for each i, to fill the c elements in the segment a[c·i .. c·(i+1)). The computation would take c units of time. We conclude that, with O(N) processors, an array of length N can be initialized in O(1) time.

Even more general than this, suppose that N/f(N) processors were available, where f(k) ≤ k for all k. For example, we might have f(k) = √k or f(k) = lg k. Whatever the case, we can imagine partitioning the array into N/f(N) segments, each of size f(N), and assigning processor i to the i-th such segment (i.e., a[f(N)·i .. f(N)·(i+1))). The running time would be O(f(N)).

In addition to running time and number of processors, another interesting measure of a parallel algorithm is its cost, which is the product of its number of processors and running time. The cost of a "sequential" (i.e., non-parallel) algorithm is simply its running time (which is, after all, the product of one —the # of processors— and its running time!).

Above we saw that the obvious parallel algorithms for the "array initialization" problem have cost O(N), corresponding to the cost (i.e., running time) of the best sequential algorithm. Very often, it is difficult or impossible to achieve this, because it requires that, on average, processors remain busy a fixed percentage of the time. In fact, for some problems it is not at all obvious how to achieve ANY significant parallelism. For example, consider the array transformation effected by this code segment:

for (int j=1;  j != N;  j = j+1)
   { a[j] = a[j/2] + a[j-1]; }

The value that ends up at a[100], for example, depends upon the values placed (during previous iterations) into a[50] and a[99], which themselves depend upon values placed (during even earlier iterations) into a[25] and a[49], and a[49] and a[98], respectively, etc., etc. There may be some way of describing, for all i, the final value of a[i] in terms of the original contents of the array a, but it is not at all clear how. Hence, there is no obvious way to fill in a[100] without having first calculated a[99], which itself requires that we first calculate a[98], etc.


Example 2: Sum of Array Elements

To calculate the sum of the elements of an array a[0..N) (using a single processor), we could use the following code segment:

int sum = 0;
for (int j=0;  j != N;  j = j+1)
   { sum = sum + a[j]; }

Its running time is O(N), of course. Suppose that we had N processors. Could we find the sum in constant time? It turns out that we can't! But we can achieve O(lg N) time using N/2 processors, as suggested by the following picture, which illustrates the case N = 8:

5    -3    7     9    2     8    1    -6         (original sequence)
 \   /      \   /      \   /      \   /
  \ /        \ /        \ /        \ /
   2         16         10         -5            (after 1 unit of time)
    \       /             \       /
     \     /               \     /
      \   /                 \   /
       \ /                   \ /
       18                     5                  (after 2 units of time)
         \                   /
           \               /
             \           /
               \       /
                 \   /
                  23                             (after 3 units of time)

The above is intended to suggest that, during the first unit of time, each of four processors adds a distinct pair of elements. This yields an array of length four. (Thus, in one unit of time, the size of the problem has been cut in half! This should remind you of binary search.) During the 2nd time unit, two processors (say, processors 0 and 1) add together the two pairs, yielding a single pair, which is added together (by processor 0, say) during the 3rd (and last) unit of time. Thus, an array of length eight is summed in three steps. Generalizing this to a length N array, during the first unit of time, N/2 pairs are added (in parallel, by, say, the first N/2 processors), yielding an array of length N/2. Then N/4 pairs are added (by the first N/4 processors), yielding an array of length N/4, etc., etc. After lg N steps, the resulting array has length one, and thus the sum of the original array has been computed.

Note: There are lots of details left unspecified here. For example, how does each processor "know" which pair of array elements (if any) to add together (and where to store the result) during a given unit of time? To keep our discussion simple, we ignore these details. End of note.

We conclude that, using O(N) processors, the sum of an array of length N can be computed in O(lg N) time. Multiplying the # of processors by running time, we calculate the cost of such a computation to be O(N lg N), which is, by a factor of lg N, worse than the cost of the sequential algorithm, O(N).

Why is the cost of the parallel algorithm greater? Is it because a greater number of operations is being carried out? The sequential algorithm performs N additions and the same number of assignments. What about the parallel algorithm? During its first iteration, it performs N/2. During the second, N/4. During the third, N/8, etc., etc. But, to a close approximation,

N = N/2 + N/4 + N/8 + ... + 1

Thus, the sequential and parallel versions do not differ in terms of how many additions/assignments they perform. So then what is the difference?

The answer has to do with processor utilization, which pertains to the degree to which processors are kept "busy", as opposed to being "idle", during execution. Let M = N/2 refer to the number of processors involved in the parallel algorithm. During its first time step, all M processors add a pair of numbers. But during the 2nd, only M/2 of them do, with the remaining M/2 being idle. During the 3rd, only M/4 of the processors are busy. And so on and so forth, with the number of processors doing anything useful being cut in half from each time step to the next. Indeed, on average, each processor is busy during only two of the lg N time steps, which accounts for the extra lg N factor in the cost of the parallel algorithm in comparison to the sequential algorithm.

Reducing the Cost to O(N):
However, by using a clever technique, we can reduce the cost of the parallel computation to O(N), matching that of the sequential algorithm. Rather than employing N/2 processors, employ only N/lg N of them. The computation now requires two phases, during the first of which each of the N/lg N processors computes the sum of an array segment of length lg N. The result is an array of length M = N/lg N. We now employ the algorithm described above (using M/2 processors) to find the sum of this array (in time O(lg M) = O(lg (N/lg N)), which, it turns out, is the same as O(lg N)). (Note: lg (N/lg N) = lg N − lg (lg N), the dominant term of which is lg N.) Thereby, we get a parallel algorithm that uses O(N/lg N) processors and has running time O(lg N), which yields a cost measure of O(N).


Example 3: RankSort

Given an array a[0..N) of, say, int's, define the rank of each element as the # of elements in a[] that are less than it. For example, take the array

    0   1   2   3   4
  +---+---+---+---+---+
a | 5 | 2 | 1 | 7 | 4 |
  +---+---+---+---+---+

Then the rank of a[0] (value 5) is three, because a[] contains three values that are less than 5. Suppose we compute the array rank[] so that, for each i, rank[i] contains the rank of a[i]. We get

       0   1   2   3   4
     +---+---+---+---+---+
rank | 3 | 1 | 0 | 4 | 2 |
     +---+---+---+---+---+

Note: In order to simplify the discussion (and the program), we will assume that no two array elements contain the same value. Otherwise, to make things work out smoothly we would define the rank of a[i] to be the number of elements in a[] that are less than a[i] plus the number of elements in a[0..i) that are equal to a[i]. End of note.

A straightforward sequential algorithm by which to do this computation is as follows:

Computing the rank[] array
// loop invariant: rank[0..i) is filled with the ranks of a[0..i)
for (int i=0;  i != N;  i = i+1)  {
   rank[i] = rankOf(a, i);
}

// Returns the rank of the element at the given location (k)
// of the given array.
int rankOf(ary[], k) {
   int cntr = 0;
   // loop invariant: cntr = rank of ary[k] with respect to ary[0..j)
   for (int j=0;  j != N;  j = j+1)  {
      if (a[j] < a[k])  { 
         cntr = cntr + 1;
      }
   }
   return cntr;
}

Its running time is O(N2), due to the fact that the # of iterations of the loop in the rankOf() method is precisely N2. (During each of the N iterations of the main loop, rankOf() is called, resulting in its loop iterating N times.)

Having computed rank[], it is easy to copy the elements of a[] into another array b[] so that b[] is in ascending order: for each i, copy a[i] into location rank[i] of b[] (i.e., into b[ rank[i] ]). (This is based on the observation that, if we sort a collection of values, each one will end up at the location corresponding to its rank!) This is accomplished via the following code:

Computing array b[]
for (int i=0;  i != N;  i = i+1) 
   { b[ rank[i] ] = a[i]; }

Clearly, this takes O(N) time. Thus, in total, RankSort takes O(N2) time, as its running time is dominated by the time used in computing the rank[] array. (Note: As the reader may have recognized, it is not really necessary to construct the rank[] array. Rather, we could simply replace the statement rank[i] = rankOf(a, i) by b[rankOf(a, i)] = a[i]. This doesn't change the (asymptotic) running time, however.)

Now consider how we could perform RankSort in parallel. Suppose that we had N processors. In parallel, each processor could compute the rank of the corresponding element of a[]. That is, for each i, processor i could compute rank[i]. This would take linear (i.e., O(N)) time, after which, for each i, processor i could copy a[i] into the correct location of b[] (namely, location rank[i]) in constant time. In total, the program would take O(N) time. Multiplying that by the number of processors, we get that its cost is O(N2), corresponding to the cost of the sequential version of the algorithm. The code is as follows:

for all i in [0..N) in parallel  {
   rank[i] = rankOf(a, i);
   b[ rank[i] ] = a[i];
}

Could we make the program any faster? As the underlying sequential algorithm's cost is O(N2), there is no way to improve upon O(N) time without using more than O(N) processors. Consider the use of N2 processors! There are N2 distinct ordered pairs (p,q), where 0≤p,q<N. We assign one processor to each such pair; its job is to compare a[i] against a[j] (where (i,j) is the pair to which it was assigned) and to write either 0 or 1 into rank2[i][j], according to whether a[i] ≤ a[j] or a[i] > a[j], respectively. All processors would do this simultaneously. Thus, in constant time, N2 processors would fill a two-dimensional array rank[][] such that, for all i and j satisfying 0≤i,j<N,

rank2[i][j] = { 0  if a[i] < a[j]
1  if a[i] > a[j]

Note: Recall that, for the sake of simplicity, we are assuming that a[i] = a[j] only in the case i = j. If we allowed for the possibility of two locations containing the same value, then 1 should be placed in rank2[i][j] in the case that both a[i] = a[j] and j<i. End of note.

For example, if a[] = < 3, 8, 1, 6, 5 >, we would get

rank2[][] (example)
    0   1   2   3   4
  +---+---+---+---+---+
0 | 0 | 0 | 1 | 0 | 0 |
1 | 1 | 0 | 1 | 1 | 1 |
2 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 0 | 1 |
4 | 1 | 0 | 1 | 0 | 0 |
  +---+---+---+---+---+

Having filled rank2[][], to compute the rank of a[i] it suffices to calculate the sum of the elements in rank[i][0..N) (i.e., row i of rank2[][]). This is because there will be a 1 in row i of rank2[][] for each array element smaller than a[i]. As there are N2 processors, we can assign N of them to each of the N rows in rank2[][] to find the sums of those rows in O(lg N) time. (We established that in Example 2 above.) Having done that, we assign one processor to each element of a[] to copy it into its proper location in b[].

// Step 1: compute rank2[][] using N2 processors in O(1) time
for all i in [0..N2) in parallel {
   p = i/N;  q = i % N;
   if (a[p] > a[q]) {
      rank2[p][q] = 1;
   else {
      rank2[p][q] = 0;
   }
}

// Step 2: compute rank[] using N2 processors in O(lg n) time
for all i in [0..N) in parallel {
   // Use processors with IDs [iN .. (i+1)N) (N of them) to sum the
   // elements in row i of rank[][] in O(lg N) time:
   rank[i] = sum of elements in rank2[i][0..N);
}

// Step 3: copy elements of a[] into b[] using N processors in O(1) time
for all i in [0..N) in parallel {
   b[rank[i]] = a[i];
}

The running time is dominated by Step 2, which takes O(lg N) time. Hence, using N2 processors, we can achieve O(lg N) running time, yielding a cost of O(N2 · lg N). Ah, but this cost is greater, by the factor lg N, than the cost of the sequential algorithm. Can we reduce the number of processors by a factor of lg N —without increasing the (asymptotic) running time— as we did in Example 2 above, in order to make the parallel algorithm's cost match that of the sequential algorithm?

Yes! Here is how: In Step 1, have each of the N2/lg N processors fill lg N of the entries of rank2[][], rather than only one entry. This takes O(lg N) time. Then in Step 2, assign N/lg N processors to each of the N rows. As we saw at the end of the presentation of Example 2, each row can still be summed in O(lg N) time. As before, Step 3, in which the elements of a[] are copied into their proper locations in b[], takes constant time using only N of the processors.