Arrays are aligned to templates through the ALIGN directive. Before describing this directive, we introduce a motivating example. The core of an LU decomposition subroutine might look like this:
REAL A (N, N) INTEGER N, R, R1 REAL, DIMENSION (N) :: L_COL, U_ROW DO R = 1, N - 1 R1 = R + 1 L_COL (R : ) = A (R : , R) A (R , R1 : ) = A (R, R1 : ) / L_COL (R) U_ROW (R1 : ) = A (R, R1 : ) FORALL (I = R1 : N, J = R1 : N) & A (I, J) = A (I, J) - L_COL (I) * U_ROW (J) ENDDO
A cursory inspection of this algorithm should suggest that a possible choice of template for the problem is
!HPF$ TEMPLATE T(N, N)This template exactly matches the main data structure of the problem, namely, the array
Awhich holds the matrix. To align
Twe can add an
ALIGNattribute declaration for
!HPF$ ALIGN A(I, J) WITH T(I, J)This is almost identical to the ALIGN directive of CM Fortran.
Jact as alignment dummies. An alignment dummy is a named integer variable appearing as a subscript of the ``alignee'' (the array which is to be aligned) in an
ALIGNdirectives. The idea is that the variable ranges over all possible index values associated with the array dimension involved. The ``align target'' (in this case the template
T) can have expressions involving the alignment dummies amongst its subscripts. In this way, every element of the alignee is mapped to some element of the template.
How should the other arrays in the program,
U_ROW be mapped? Most of the work in a single iteration is in
A (I, J) = A (I, J) - L_COL (I) * U_ROW (J)We can minimise (indeed, eliminate) the need for communication in this assignment if
L_COL (I)is stored wherever
A (I, J), for any
Jis stored, and if
U_ROW (J)is stored wherever
A (I, J)for any
Jis stored. This can be specified by using a replicated alignment to the template
!HPF$ ALIGN L_COL (I) WITH T (I, *) !HPF$ ALIGN U_ROW (J) WITH T (*, J)where an asterisk as a template subscript means that data is replicated in the corresponding processor dimension. Figure 9 is an attempt to visualise the alignment of the three arrays and the template.
As claimed, the main assignment in the
construct will require no communications, because all operands
of each elemental assignment will be stored on the same processor.
What about the other statements?
A (R , R1 : ) = A (R, R1 : ) / L_COL (R)requires no communication. It is equivalent to
FORALL (J = R1 : N) A (R, J) = A (R, J) / L_COL (R)and we know that
L_COL (R)will be stored on any processor where
A (R, J)is stored.
The other two array assignments do require communication.
For example, the assignment to
L_COL is equivalent to
FORALL (I = R : N) L_COL (I) = A (I, R)Because
L_COL (I)is replicated in the J direction, whereas
A (I, R)is held only on the processor holding the template element, updating the
L_COLelement involves broadcasting the
Aelement to all interested parties. The compiler will duly insert these communications. Similarly, the
U_ROWassignment involves broadcasts, this time in the I direction.
Having chosen a plausible alignment of our arrays to a template,
we have to distribute that template.
isn't particularly attractive for this computation, because successive
iterations work on a shrinking area of the template. A block
distribution could leave whole processors idle in later iterations.
CYCLIC distribution will achieve better load balancing (Figure
Mprocessor arrangement, is displayed in Figure 11.