In this section we will give a detailed example of how dimension splitting can be combined with the new inquiries on the Range class to write optimized parallel code that works for input arrays with general distribution formats.
A prefix computation (also sometimes called a ``scan'' operation) takes an array as input, and outputs an array containing the set of partial sums of the elements of that array. If the input array is a and the result array is r, we want the outcome to be:
We can give a straightforward data-parallel algorithm for this computation using a doubling technique. Possible code is given in Figure 7.15. The algorithm is very simple. In a given iteration, the current value in the result array is shifted by an amount that doubles between iterations. The shifted array is added into the current array. The algorithm takes iterations to complete. Its operation is illustrated in Figure 7.16.
As often happens, the pure data parallel program is concise and quite readable. However it isn't very efficient. It requires a total of floating point additions, whereas the naive sequential algorithm only needs . So we can only expect useful parallel speedup if . The pure data parallel version needs optimization.