One disadvantage of the program in Figure 3.18
of the last section
is that it allocates two very large temporary arrays, `ta` and `tb`. Because these are both replicated in one dimension, they can
easily consume more memory than the original arguments. This problem can be
solved by storing copies of elements only in restricted bands of the
original matrices at any one time. Figure 4.5
gives a modified algorithm where the maximum band width is `B`.

For simplicity we assumed here that
`B` is a compile-time constant. Alternatively we can compute
this value dynamically.
The `volume()` method on `Range` is used internally by array
constructors to control allocation of memory for array elements. It defines
the largest block of locations of the current range held by any processor.
Hence an upper bound on the number of elements held by any processor
for `ta` and `tb` combined is

If

With a few refinements like this, the algorithm of Figure 4.5 becomes a reasonable basis for a library matrix multiplication routine, applicable to generic distributed arrays.