An extended example: prefix computation

In this section we will give a detailed example of how dimension
splitting can be combined with the new inquiries on the `Range`
class to write optimized parallel code that works for input arrays with
general distribution formats.

A prefix computation (also sometimes called a ``scan'' operation)
takes an array as input, and outputs an array containing the
set of partial sums of the elements of that array. If the input
array is `a` and the result array is `r`, we want the
outcome to be:

This is the

We will be interested in computing the inclusive form, but sometimes the exclusive form is needed in intermediate steps.

We can give a straightforward data-parallel algorithm for this
computation using a *doubling* technique. Possible code is given
in Figure 7.15. The algorithm is very simple.
In a given iteration, the current value in the result array is shifted
by an amount that doubles between iterations. The shifted array is
added into the current array. The algorithm takes
iterations to complete. Its operation is illustrated in Figure
7.16.

As often happens, the pure data parallel program is concise and quite readable. However it isn't very efficient. It requires a total of floating point additions, whereas the naive sequential algorithm only needs . So we can only expect useful parallel speedup if . The pure data parallel version needs optimization.

- Optimization for block distribution formats
- Cyclic distribution formats
- Optimization for ``general'' distribution formats