Here is an exceedingly simple HPF program:
!HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P FORALL (I = 1:50) A(I) = 1.0 * IWe want to convert this data-parallel program into a program that can be executed directly on a typical MIMD computer or cluster of workstations. We assume that the target machine provides a Fortran compiler and an MPI library. A reasonable translation is given in Figure 1. This looks a little complicated. We will break it down into its parts.
First we have a series of statements that simply set up the environment:
INTEGER W_RANK, W_SIZE, ERRCODE ... MPI_INIT() MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE) MPI_COMM_SIZE(MPI_COMM_WORLD, W_SIZE, ERRCODE) ... MPI_FINALIZE()So far as the present example is concerned, their only important role is to define the variables W_RANK, W_SIZE.
Secondly we have statements associated with the declaration of the distributed array:
INTEGER BLK_SIZE PARAMETER (BLK_SIZE = (50 + 3) / 4) REAL A(BLK_SIZE)These compute a bound on the number of elements of the distributed array held on any processor, and allocate a local array large enough the hold those elements. In the present case the size of the distributed array (50) and the number of processors over which it is distributed are both known at compile-time. So the block size, , can be computed at compile-time, and the array segment can be declared statically.
The remainder of the code is associated with transation of the FORALL statement assigning A. There is a test to see if the local processor is a member of the processor arrangement P over which the elements of the assigment variable A are distributed:
IF (W_RANK < 4) THEN ... ENDIFWe will always adopt the policy that an HPF processor arrangement may have any number of abstract processors up to the number of physical processors available at run-time (or conversely number of physical processors provided at run-time must be at least the size size of the largest processor arrangement declared in the program). This is not the only possible policy, but it is a reasonable one, and it is consistent with what the HPF standard says1.
If the local processor is in P, we compute some parameters of the locally held subset of distributed array elements--the values BLK_START and BLK_COUNT. These characterize the position of the local segment in the global index space, and the number of elements in that segment:
INTEGER BLK_START, BLK_COUNT ... BLK_START = W_RANK * BLK_SIZE IF(50 - BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF(50 - BLK_START > 0) THEN BLK_COUNT = 50 - BLK_START ELSE BLK_COUNT = 0 ENDIFFinally we have the actual loop over the local elements:
INTEGER L, I ... DO L = 1, BLK_COUNT I = BLK_START + L A(L) = 1.0 * I ENDDOThe global index is computed as a linear function of the local loop index L, and the assignment to the array element is performed.
Of course there was quite a lot of book-keeping in the MPI translation, but nothing conceptually difficult.
Now consider this superficially similar fragment of HPF:
SUBROUTINE INIT(A) REAL A(50) !HPF$ INHERIT A FORALL (I = 1:50) A(I) = 1.0 * I ENDThe INHERIT directive specifies that the mapping of the dummy argument A should be the same as the actual argument, whatever that is. At compile-time we have no information about that mapping, other than it is a legal HPF decomposition. If the main program was
!HPF$ PROCESSORS P(4) REAL B(50) !HPF$ DISTRIBUTE B(BLOCK) ONTO P CALL INIT(B)then the translated subroutine ought to behave exactly as in the previous example. But at the time INIT is translated this fact is generally not known. The subroutine may be part of a library compiled long before the calling program is written.
A few of the possibilities for mapping of the actual argument are illustrated in Figure 2. The first case has already been discussed. The array A will be divided into 4 contiguous blocks of sizes (13,13,13,11). In the second case the blocking is different--(13,13,12,12)--and the formula for computing the index value I inside the local loop will be quite different. The third case illustrates that A might actually be some section of an array. In this example the blocking is (13,12,13,12) and the translation must somehow take account that the subscripting into the local segment of the array is strided, and there is an offset which is sometimes zero and sometimes one. In the final case A is again a section, but now the parent array is two-dimensional, and we have to deal with the fact that A as a whole is mapped only to a portion of the processor arrangement. The physical processors associated with the bottom row of Q should do nothing.
Somehow the procedure INIT needs to be translated in such a way that it can deal with all these cases, and many others.
We have deliberately started with a very simple case. The parallel part of the program only deals with a single one-dimensional array, and it is clear that the translation requires absolutely no inter-processor communication. Nevertheless we have already stumbled into a whole wad of complexity associated with the translation of procedures.
This is not an artificial example. General purpose libraries for dealing with distributed arrays are likely to encounter just this kind of problem. As a specific example, the array communication libraries discussed in the next lecture have exactly this requirement: they must deal with distributed arrays that have unrestricted HPF mappings, unknown at the time the library code is compiled2.
This lecture mainly discusses a run-time parameterization of HPF arrays that is designed to deal with this kind of problem.