It seems difficult to give a truly efficient parallel prefix algorithm when the data has cyclic distribution format. It looks like an IO bound problem. It is certainly possible to find optimized schemes that are better than our naive data parallel versionfor example reducing the number of communication operations.
In general, even if an operation like this cannot be implemented with really good arithmetic performance, it may still be worth improving it. The operation in question might be a necessary step in larger program. So it may be worth our while to optimize the procedure so that it does not form a bottleneck in the larger context, even if we can't make it really efficient as an arithmetic operation in its own right.
An improved scheme is illustrated in Figure 7.19. To make the example a bit more interesting, it covers the case of an array with a nontrivial (stride 2) alignment to a cyclic range.

The input array is copied to a temporary (dimension split) array ts, with zeros in positions for which there is no corresponding element of a. Then the global prefixes are formed across individual rows (as drawn here). With the the blocks (columns) treated as vectors, we would only need shifttype operations to do this, if we used a naive doubling algorithm for this stage.
An exclusive prefix of the final columnthe row sumsis broadcast to all processes. This exclusive prefix is added to the incomplete prefixes across the rows, computed previously. The results are copied back to a.
It is straightforward enough to implement this scheme in HPJava using dimension splitting. But the improvement in efficiency is probably not dramatic, and we will save space here by omitting the specialized code. (To tell the truth the ``naive'' data parallel version may not be much worse in practice.)