For the most straightforward, block-based distribution formats--these include arrays parameterized BlockRange, ExtBlockRange and IrregRange, there is a fairly straightforward optimization. We can do prefix computation within individual blocks, then do a global exclusive prefix combining the sums of the blocks. Finally we add the global prefix back to the incomplete prefixes within the blocks. This is illustrated in Figures 7.17, 7.18.
This version has a hope to be reasonably efficient if , so arithmetic costs have a chance to dominate communication cost. It still does about total additions operations instead the operations for the sequential code. But in principle if it should be possible to compensate for this factor, and gain some parallel speedup.
The code given here will work for the ranges mentioned above, and subranges of these^{7.5}. It will not work for cyclic distributions.