next up previous contents index
Next: Optimization for ``general'' distribution Up: An extended example: prefix Previous: Optimization for block distribution   Contents   Index

Cyclic distribution formats

It seems difficult to give a truly efficient parallel prefix algorithm when the data has cyclic distribution format. It looks like an IO bound problem. It is certainly possible to find optimized schemes that are better than our naive data parallel version--for example reducing the number of communication operations.

In general, even if an operation like this cannot be implemented with really good arithmetic performance, it may still be worth improving it. The operation in question might be a necessary step in larger program. So it may be worth our while to optimize the procedure so that it does not form a bottleneck in the larger context, even if we can't make it really efficient as an arithmetic operation in its own right.

An improved scheme is illustrated in Figure 7.19. To make the example a bit more interesting, it covers the case of an array with a non-trivial (stride 2) alignment to a cyclic range.

Figure: Possible optimization of prefix for cyclic distributions: illustration for stride 2 subrange of cyclic range with extent 15. $\mbox{\tt P} = 5$.
\includegraphics[width=5in]{cyclicPrefix.eps}

The input array is copied to a temporary (dimension split) array ts, with zeros in positions for which there is no corresponding element of a. Then the global prefixes are formed across individual rows (as drawn here). With the the blocks (columns) treated as vectors, we would only need $\log_2(\mbox{\tt P})$ shift-type operations to do this, if we used a naive doubling algorithm for this stage.

An exclusive prefix of the final column--the row sums--is broadcast to all processes. This exclusive prefix is added to the incomplete prefixes across the rows, computed previously. The results are copied back to a.

It is straightforward enough to implement this scheme in HPJava using dimension splitting. But the improvement in efficiency is probably not dramatic, and we will save space here by omitting the specialized code. (To tell the truth the ``naive'' data parallel version may not be much worse in practice.)


next up previous contents index
Next: Optimization for ``general'' distribution Up: An extended example: prefix Previous: Optimization for block distribution   Contents   Index
Bryan Carpenter 2003-04-15