next up previous contents
Next: Other alignment options. Up: High Performance Fortran Previous: The Template and distribution   Contents


Data alignment

Arrays are aligned to templates through the ALIGN directive. Before describing this directive, we introduce a motivating example. The core of an LU decomposition subroutine might look like this:

      REAL A (N, N)
      INTEGER N, R, R1
      REAL, DIMENSION (N) :: L_COL, U_ROW

      DO R = 1, N - 1
        R1 = R + 1
        L_COL (R  : ) = A (R : , R)
        A (R , R1 : ) = A (R, R1 : ) / L_COL (R)
        U_ROW (R1 : ) = A (R, R1 : )
        FORALL (I = R1 : N, J = R1 : N)
     &    A (I, J) = A (I, J) - L_COL (I) * U_ROW (J)
      ENDDO

A cursory inspection of this algorithm should suggest that a possible choice of template for the problem is

!HPF$ TEMPLATE T(N, N)
This template exactly matches the main data structure of the problem, namely, the array A which holds the matrix. To align A to T we can add an ALIGN attribute declaration for A as follows
!HPF$ ALIGN A(I, J) WITH T(I, J)
This is almost identical to the ALIGN directive of CM Fortran. I and J act as alignment dummies. An alignment dummy is a named integer variable appearing as a subscript of the ``alignee'' (the array which is to be aligned) in an ALIGN directives. The idea is that the variable ranges over all possible index values associated with the array dimension involved. The ``align target'' (in this case the template T) can have expressions involving the alignment dummies amongst its subscripts. In this way, every element of the alignee is mapped to some element of the template.

How should the other arrays in the program, L_COL and U_ROW be mapped? Most of the work in a single iteration is in the statement

      A (I, J) = A (I, J) - L_COL (I) * U_ROW (J)
We can minimise (indeed, eliminate) the need for communication in this assignment if L_COL (I) is stored wherever A (I, J), for any J is stored, and if U_ROW (J) is stored wherever A (I, J) for any J is stored. This can be specified by using a replicated alignment to the template T:
!HPF$ ALIGN L_COL (I) WITH T (I, *)
!HPF$ ALIGN U_ROW (J) WITH T (*, J)
where an asterisk as a template subscript means that data is replicated in the corresponding processor dimension. Figure 9 is an attempt to visualise the alignment of the three arrays and the template.
Figure 9: Alignment of the three arrays in the LU decomposition example
\begin{figure}\centerline{\psfig{figure=replicated-example.eps,height=2.5in}}\end{figure}

As claimed, the main assignment in the FORALL construct will require no communications, because all operands of each elemental assignment will be stored on the same processor. What about the other statements?

      A (R , R1 : ) = A (R, R1 : ) / L_COL (R)
requires no communication. It is equivalent to
      FORALL (J = R1 : N)  A (R, J) = A (R, J) / L_COL (R)
and we know that L_COL (R) will be stored on any processor where A (R, J) is stored.

The other two array assignments do require communication. For example, the assignment to L_COL is equivalent to

      FORALL (I = R : N)  L_COL (I) = A (I, R)
Because L_COL (I) is replicated in the J direction, whereas A (I, R) is held only on the processor holding the $J = R$ template element, updating the L_COL element involves broadcasting the A element to all interested parties. The compiler will duly insert these communications. Similarly, the U_ROW assignment involves broadcasts, this time in the I direction.

Having chosen a plausible alignment of our arrays to a template, we have to distribute that template. BLOCK distribution isn't particularly attractive for this computation, because successive iterations work on a shrinking area of the template. A block distribution could leave whole processors idle in later iterations. A CYCLIC distribution will achieve better load balancing (Figure 10).

Figure 10: A elements involved (shaded) in iteration R=4, when N=6 and processor arrangement is 3 by 3. Block and cyclic distributions are illustrated.
\begin{figure}\centerline{\psfig{figure=block-example.eps,height=2.5in}}\vspace{0.4in}
\centerline{\psfig{figure=cyclic-example.eps,height=2.5in}}\end{figure}
Our final HPF program, distributed onto an M by M processor arrangement, is displayed in Figure 11.

Figure 11: Final HPF implementation of LU decomposition.
\begin{figure}\small\begin{verbatim}INTEGER M, N, R, R1!HPF$ PROCESSORS P (...
...) = A (I, J) - L_COL (I) * U_ROW (J)
ENDDO\end{verbatim}\normalsize\end{figure}


next up previous contents
Next: Other alignment options. Up: High Performance Fortran Previous: The Template and distribution   Contents
Bryan Carpenter 2002-07-12