next up previous contents index
Next: Optimization for block distribution Up: Low level SPMD programming Previous: Local arrays   Contents   Index


An extended example: prefix computation

In this section we will give a detailed example of how dimension splitting can be combined with the new inquiries on the Range class to write optimized parallel code that works for input arrays with general distribution formats.

A prefix computation (also sometimes called a ``scan'' operation) takes an array as input, and outputs an array containing the set of partial sums of the elements of that array. If the input array is a and the result array is r, we want the outcome to be:

\begin{eqnarray*}
\mbox{\tt r [} i \mbox{\tt ]} & = &
\mbox{\tt a [} 0 \mbox{\tt ]} + \ldots + \mbox{\tt a [} i \mbox{\tt ]}
\end{eqnarray*}

This is the inclusive form of prefix computation. There is also an exclusive form which would be defined by

\begin{eqnarray*}
\mbox{\tt r [} 0 \mbox{\tt ]} & = & 0 \\
\mbox{\tt r [} i \...
... \ldots + \mbox{\tt a [} i - 1 \mbox{\tt ]}
; \quad 1 \le i < N
\end{eqnarray*}

We will be interested in computing the inclusive form, but sometimes the exclusive form is needed in intermediate steps.

We can give a straightforward data-parallel algorithm for this computation using a doubling technique. Possible code is given in Figure 7.15. The algorithm is very simple. In a given iteration, the current value in the result array is shifted by an amount that doubles between iterations. The shifted array is added into the current array. The algorithm takes $\log_2(N)$ iterations to complete. Its operation is illustrated in Figure 7.16.

Figure 7.15: Simple doubling algorithm for parallel prefix

\begin{displaymath}
\begin{minipage}[t]{\linewidth}\small\begin{verbatim}stat...
...: N - 1)
a [i] += t [i] ;
}
}
}\end{verbatim}\end{minipage}\end{displaymath}

Figure 7.16: Illustration of doubling algorithm for parallel prefix for N = 7.
\includegraphics[width=5in]{doublingPrefix.eps}

As often happens, the pure data parallel program is concise and quite readable. However it isn't very efficient. It requires a total of $N \times \log_2(N)$ floating point additions, whereas the naive sequential algorithm only needs $N$. So we can only expect useful parallel speedup if $P \gg \log_2(N)$. The pure data parallel version needs optimization.



Subsections
next up previous contents index
Next: Optimization for block distribution Up: Low level SPMD programming Previous: Local arrays   Contents   Index
Bryan Carpenter 2003-04-15