For the most straightforward, block-based distribution formats--these include arrays parameterized BlockRange, ExtBlockRange and IrregRange, there is a fairly straightforward optimization. We can do prefix computation within individual blocks, then do a global exclusive prefix combining the sums of the blocks. Finally we add the global prefix back to the incomplete prefixes within the blocks. This is illustrated in Figures 7.17, 7.18.
This version has a hope to be reasonably efficient if
, so
arithmetic costs have a chance to dominate communication cost. It still
does about
total additions operations instead the
operations for
the sequential code. But in principle if
it should be possible
to compensate for this factor, and gain some parallel speedup.
The code given here will work for the ranges mentioned above, and subranges of these7.5. It will not work for cyclic distributions.