Arrays are aligned to templates through the ALIGN directive. Before describing this directive, we introduce a motivating example. The core of an LU decomposition subroutine might look like this:
REAL A (N, N)
INTEGER N, R, R1
REAL, DIMENSION (N) :: L_COL, U_ROW
DO R = 1, N - 1
R1 = R + 1
L_COL (R : ) = A (R : , R)
A (R , R1 : ) = A (R, R1 : ) / L_COL (R)
U_ROW (R1 : ) = A (R, R1 : )
FORALL (I = R1 : N, J = R1 : N)
& A (I, J) = A (I, J) - L_COL (I) * U_ROW (J)
ENDDO
A cursory inspection of this algorithm should suggest that a possible choice of template for the problem is
!HPF$ TEMPLATE T(N, N)This template exactly matches the main data structure of the problem, namely, the array
A which holds the matrix. To align A
to T we can add an ALIGN attribute declaration for
A as follows
!HPF$ ALIGN A(I, J) WITH T(I, J)This is almost identical to the ALIGN directive of CM Fortran.
I and J act as alignment dummies.
An alignment dummy is a named integer variable appearing as a subscript
of the ``alignee'' (the array which is to be aligned) in an
ALIGN directives. The idea is that the variable ranges over all
possible index values associated with the array dimension involved.
The ``align target'' (in this case the template T) can have
expressions involving the alignment dummies amongst its subscripts. In this
way, every element of the alignee is mapped to some element of the
template.
How should the other arrays in the program, L_COL and
U_ROW be mapped? Most of the work in a single iteration is in
the statement
A (I, J) = A (I, J) - L_COL (I) * U_ROW (J)
We can minimise (indeed, eliminate) the need for communication
in this assignment if L_COL (I) is stored wherever
A (I, J), for any J is stored, and if
U_ROW (J) is stored wherever A (I, J) for any J
is stored. This can be specified by using a replicated
alignment to the template T:
!HPF$ ALIGN L_COL (I) WITH T (I, *) !HPF$ ALIGN U_ROW (J) WITH T (*, J)where an asterisk as a template subscript means that data is replicated in the corresponding processor dimension. Figure 9 is an attempt to visualise the alignment of the three arrays and the template.
As claimed, the main assignment in the FORALL
construct will require no communications, because all operands
of each elemental assignment will be stored on the same processor.
What about the other statements?
A (R , R1 : ) = A (R, R1 : ) / L_COL (R)
requires no communication. It is equivalent to
FORALL (J = R1 : N) A (R, J) = A (R, J) / L_COL (R)
and we know that L_COL (R) will be stored on any processor
where A (R, J) is stored.
The other two array assignments do require communication.
For example, the assignment to L_COL is equivalent to
FORALL (I = R : N) L_COL (I) = A (I, R)
Because L_COL (I) is replicated in the J direction,
whereas A (I, R)
is held only on the processor holding the L_COL element involves broadcasting the
A element to all interested parties.
The compiler will duly insert these communications.
Similarly, the U_ROW assignment
involves broadcasts, this time in the I direction.
Having chosen a plausible alignment of our arrays to a template,
we have to distribute that template. BLOCK distribution
isn't particularly attractive for this computation, because successive
iterations work on a shrinking area of the template. A block
distribution could leave whole processors idle in later iterations.
A CYCLIC distribution will achieve better load balancing (Figure
10).
![]() |
M by M
processor arrangement, is displayed in Figure 11.