Stencil updates, in which each element of an array is updated in terms of some fixed footprint or ``stencil'' of neighbouring elements, have been an important target for parallel computing. They are quite common in practise, arising for example in solution of partial differential equations, simulation of cellular automata, and image processing applications. This situation is certainly amenable to massive parallelism, but it does require some communication.
These problems can be reduced to the kinds of array assignment discussed in Section 1.1. The well-known example:
FORALL (I = 2:N-1, J = 2:N-1) & U(I,J) = 0.25 * (U(I,J-1) + U(I,J+1) + U(I-1,J) + U(I+1,J))can be recast as:
U(2:N-1, 2:N-1) = 0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N) + & U(1:N-2, 2:N-1) + U(3:N, 2:N-1))The compiler could translate the array assignment code by introducing a series of temporary arrays T1, ..., T1 all aligned with the section
T1 = U(2:N-1, 1:N-2) T2 = U(2:N-1, 3:N) T3 = U(1:N-2, 2:N-1) T4 = U(3:N, 2:N-1)) U(2:N-1, 2:N-1) = 0.25 * (T1 + T2 + T3 + T4)Due to the alignment of the temporary arrays, the last statement is purely local computation; the first four assignments absorb all communication. If the the arrays here have been distributed in the BLOCK style--usually optimum for this kind of problem--these communications follow a pattern similar to Figure 1.
In terms of the volume of interprocessor communication this is a reasonable approach. But it has two big drawbacks. First it uses a lot of temporary storage--four whole arrays. Secondly, although the first four assignments involve no computation and only a modest amount of communication at the edges, they introduce a lot of extra memory-to-memory copying.
A sequential version of the original loop has good locality properties in terms of its use of memory and cache. Because the elements on the left-hand-side and the operands in the right-hand-side of each assignment are typically clustered together in memory (spatial locality), and because elements are often reused in consecutive iterations, or iterations not far apart in iteration space (temporal locality), a straightforward transcription of the algorithm is likely to make good use of the cache of a modern microprocessor . But if the program is split into consecutive array assignments as indicated above, much of this locality will be lost. Generally whole arrays will not fit in cache memory. Cache will be loaded from main memory and flushed back for each of the five loops associated with the five array assignments.
In fact this is a generic problem for the ``array syntax'' style of programming encouraged by Fortran 90. As we have seen in preceding lectures, this style evolved in large part to support a generation of SIMD and vector processors. On contemporary microprocessors, memory access costs usually dominate over the costs of arithmetic. Ironically a compiler may end up working quite hard to fuse the array assignments of a Fortran 90 program into sequential loops that have good memory locality.
A better approach than copying whole arrays of neighbours is to allocate the local segments of the distributed array U with a small amount of extra space around the edges--so-called ghost regions. The translation of the FORALL statement above could be something like:
REAL U(0 : BLK_SIZE1 + 1, 0 : BLK_SIZE2 + 1) REAL T(BLK_SIZE1, BLK_SIZE2) ... Update ghost regions of U with values from neighbours DO L1 = 1, BLK_COUNT1 DO L2 = 1, BLK_COUNT2 T(L1, L2) = 0.25 * (U(L1, L2 - 1) + U(L1, L2 + 1) & + U(L1 - 1, L2) + U(L1 + 1, L2)) ENDDO ENDDO ... Copy T to interior part of U (local copying)The communications required to update the ghost regions are illustrated in Figure 3. For the sake of definiteness we will refer to the interior part of the local segment--the non-ghost part--as the physical segment. The communication pattern illustrated here updates the cells in the ghost regions with the values from the corresponding cells in the physical segments.