next up previous contents
Next: Subranges and restricted groups Up: Array Sections Previous: Cholesky decomposition   Contents

Matrix multiplication with reduced memory

One disadvantage of the program in Figure 3.15 is that it allocates two very large temporary arrays, ta and tb. Because these are both replicated in one dimension, they can easily consume more memory than the original data. This problem can be solved by only storing copies of elements in restricted bands of the original matrices at any one time. Figure 4.5 gives a modified algorithm where the maximum band width is B.

Figure 4.5: Matrix multiplication with reduced memory requirement.
\begin{figure}\small\begin{verbatim}void matmul(float [[,]] c, float [[,]] a,...
...
c [i, j] += ta [i, k] * tb [k, j] ;
}
}\end{verbatim}\normalsize\end{figure}

For simplicity we assumed here that B is a compile-time constant. Alternatively we can compute this value dynamically. The volume() method on Range is used internally by array constructors to control allocation of memory for array elements. It defines the largest block of locations for the current range held by any processor. Hence an upper bound on the number of elements held by any processor for ta and tb combined is

    B * x.volume() + B * y.volume()
If MAX_TEMPORARY_SIZE is a constant defining a limit on the total volume of memory we ever wish to allocate for temporary arrays, a suitable formula for B might be
    B = MAX_TEMPORARY_SIZE / (x.volume() + y.volume())
With a few refinements like this, the algorithm of Figure 4.5 is a credible basis for a library matrix multiplication routine, applicable to generic distributed arrays.


next up previous contents
Next: Subranges and restricted groups Up: Array Sections Previous: Cholesky decomposition   Contents
Bryan Carpenter 2002-07-12