Having seen how to define one or more target processor arrangements, we need to introduce mechanisms for distributing data arrays over those arrangements. HPF allows arrays to be distributed over processors directly (see Section 4.5), but it is often more satisfactory to go through the intermediary of an explicit template. HPF templates are used in ways reminiscent of the implicit VP set of CM Fortran or the shape of C*. In the MIMD world anticipated by HPF, a template is distinct from a processor arrangement. The set of abstract processors in an HPF processor arrangement might not exactly match the set of physical processors, but there is a tacit assumption that abstract processors will be used at a similar level of granularity to the physical processors. Usually it would be inappropriate for the shape parameters of the abstract processor arrangement to correspond to those of the data arrays of the algorithm. Instead the fine-grained grid of the data arrays is captured in the template concept.
Figure 7 represents the HPF data mapping scheme: the rest of this section is concerned with the bottom half of the diagram. Mapping of arrays to templates will be discussed in the next section.
A template can be declared in much the same way as a processor arrangement.
!HPF$ TEMPLATE T(50, 50, 50)delares a 50 by 50 by 50 three-dimensional template called
T. Having declared it, we usually want to establish a relation between a template and some processor arrangement. We want to say in more or less detail how the elements of the template are distributed amongst the elements of the processor arrangement. This is done by using DISTRIBUTE directive.
As a first example, suppose we have
!HPF$ PROCESSORS P1(4) !HPF$ TEMPLATE T1(17)There are various ways in which
T1may be distributed over
P1. The four basic distribution formats are illustrated in figure 8.
Simple block distribution is specified by
!HPF$ DISTRIBUTE T1(BLOCK) ONTO P1In this case, each processor gets a contiguous block of template elements. All processors get the same sized block, unless the number of processors doesn't divide the number of template elements. In this case the template elements are divided evenly over most of the processors, with some trailing processor(s) having less (or zero).
Simple cyclic distribution is specified by
!HPF$ DISTRIBUTE T1(CYCLIC) ONTO P1The first processor gets the first template element, the second gets the second, and so on. When the set of processors is exhausted, go back to the first processor, and continue allocating the template elements from there.
In a variant of the block distribution, the number of template elements allocated to each processor can be explicitly specified, as in
!HPF$ DISTRIBUTE T1 (BLOCK (6)) ONTO P1If this means that we allocate all template elements before exhausting processors, some processors are left empty. It is illegal with to choose a block size here which cause template elements to be left over after all processors have had their blocks allocated. But in an analogous variant of the cyclic distribution (``block-cyclic distribution'')
!HPF$ DISTRIBUTE T1 (BLOCK (3)) ONTO P1the product of the number of processors with the block size can be smaller than the template size, and allocation wraps round after the first assignment of blocks to all processors.
That covers the case where both template and processor are one dimensional. When the template both have (the same) higher dimension, each dimension can be distributed independently, mixing any of the four distribution formats. The correspondence between the template and the processor dimension is the obvious one. In
!HPF$ PROCESSORS P2 (4, 3) !HPF$ TEMPLATE T2 (17, 20) !HPF$ DISTRIBUTE T2 (CYCLIC, BLOCK) ONTO P2the first dimension of
T2is distributed cyclically over the first dimension of
P2; the second dimension is distributed blockwise over the second dimension of
Finally, some dimensions of a template may have ``collapsed distributions'', allowing a template to be distributed onto a processor arrangement with fewer dimensions than the template. So
!HPF$ DISTRIBUTE T2 (BLOCK, *) ONTO P1means that the first dimension of
T2will be distributed over
T1in the first example above. But for a fixed value of the first index of
T2, all values of the second subscript are mapped to the same processor.