Data Alignment

The directive, `ALIGN`

aligns arrays to the templates.
We consider an example. The core code of an LU decomposition
subroutine looks as follows;

01 REAL A (N, N) 02 INTEGER N, R, R1 03 REAL, DIMENSION (N) :: L_COL, U_ROW 04 05 DO R = 1, N - 1 06 R1 = R + 1 07 L_COL (R : ) = A (R : , R) 08 A (R , R1 : ) = A (R, R1 : ) / L_COL (R) 09 U_ROW (R1 : ) = A (R, R1 : ) 10 FORALL (I = R1 : N, J = R1 : N) 11 & A (I, J) = A (I, J) - L_COL (I) * U_ROW (J) 12 ENDDO |

!HPF$ TEMPLATE T (N, N) |

`A`

that holds
the matrix, is identically matched with this template. In order to
align `A`

to `T`

we need an `ALIGN`

directive like;
!HPF$ ALIGN A(I, J) WITH T (I, J) |

`DO`

-loop from our
example is in the following statement, which is line 11 of the program,
A (I, J) = A (I, J) - L_COL (I) * U_ROW |

`L_COL (I)`

and
`U_ROW (J)`

are allocated wherever
`A (I, J)`

is allocated. The following statement can manage it
using a `T`

,
!HPF$ ALIGN L_COL (I) WITH T (I, *) !HPF$ ALIGN U_ROW (J) WITH T (*, I) |

`FORALL`

construct since all operands of each elemental assignment will be
allocated on the same processor. Do the other statements require some
communications?
The line 8 is equivalent to
FORALL (J = R1 : N) A (R, J) = A (R, J) / L_COL (R) |

`L_COL (R)`

will be available on any
processor wherever `A (R, J)`

is allocated, it requires no
communications.
But, the other two array assignment statements `L_COL`

, which is
the line 7 of the program, is equivalent to
FORALL (I = R : N) L_COL (I) = A (I, R) |

`L_COL (I)`

is replicated in the `J`

direction,
while A (I, R) is allocated only on the processor which holds the
template element where , updating the `L_COL`

element
is to broadcast the `A`

element to all concerned parties. These
communications will be properly inserted by the compiler.
The next step is to distribute the template (we already aligned the
arrays to a template). A `BLOCK`

distribution is not good choice
for this algorithm since successive iterations work on a shrinking
area of the template. Thus, a block distribution will make some
processors idle in later iterations. A `CYCLIC`

distribution will
accomplish better load balancing
In the above example, we illustrated simple alignment--``identity
mapping'' array to template--and also replicated alignments. What
would general alignments look like?
One example is that we can transpose an array to a template.
DIMENSION B(N, N) !HPF$ ALIGN B(I, J) WITH T(J, I) |

`B`

to `T`

(`B (1, 2)`

is aligned
to `T (2, 1)`

, and so on). More generally, a subscript of an
DIMENSION C(N / 2, N / 2) !HPF$ ALIGN C(I, J) WITH T(N / 2 + I, 2 * J) |

DIMENSION D(N, N, N) !HPF$ ALIGN D(I, J, K) WITH T(I, J) |

`T`

, is `K`

. For fixed `I`

and
`J`

, each element of the array, `D`

, is mapped to the same
template element.
In this section, we have covered HPF's processor arrangement,
distributed arrays, and data alignment which we will basically adopt
to the HPspmd programming model we present in chapter
4.