The most successful parallel computers presently available are built from a number of largely autonomous processors, each possessing its own local memory area (Figure 3). A processor can access values in its local memory very rapidly. It can also access values in another processor's memory, either by special machine instructions, or through message-passing software. But with current technology remote memory accesses are much slower (often by orders of magnitude) than local ones. This puts an onus on the programmer or compiler to minimise the number of accesses any processor makes to non-local memory. This is the communication problem.
A second problem is ensuring that each processor has a fair share of the total work load. Obviously, if most work ends up being done by one processor the benefits of parallelism are lost. This is the load-balancing problem.
High Performance Fortran allows the programmer to add various directives to a program. These tell the compiler how program data is to be distributed amongst the memory areas associated with a set of (real or virtual) processors. They don't allow the programmer to state directly which processor will perform a particular computation. But it is expected that if the operands of a particular sub-computation (for example, an assignment) are all found on the same processor, the compiler will allocate that part of the computation to the processor holding the operands, whereupon no remote memory accesses will be involved.
Hence the main burden on the HPF programmer is to distribute the program's data across the processors in such a way that
To make these points a bit more concrete, imagine an in idealised
situation where the a particular step in a program contains
sub-calculations which can proceed in parallel. They might be the
elemental assignments making up an array assignment or
FORALL construct. For the sake of argument, suppose each
expression to be computed involves two operands. Together with the
variable being assigned, this makes three operands for the whole
sub-calculation. Also, suppose there are
processors available.
(In general, the numbers of sub-calculations and processors are likely
to be different--this case is just illustrative).
Figure 4 illustrates the situation where all operands of each sub-calculation reside in different memory areas. This means that each assignment involves at least two communications, wherever the calculation is performed.
Figure 5 illustrates a situation in which no communication is necessary, because all operands of all the assignments live on the same processor. But in this case the calculation will generally be performed where the data is, so no effective parallelism occurs. (Alternatively, the compiler might decide to share the tasks out anyway. But then all operands of tasks on processors other than the first would have to be communicated.)
Figure 6 illustrates the ideal situation, where all operands of an individual assignment occur on the same processor, but the operand groups of the sub-calculations are uniformly distributed over processors. In this case we can rely on the compiler to allocate each computation to the processor holding the operands, requiring no communication, but perfect distribution of workload.
Except in the most trivial programs, it will never be possible to choose a distribution of program variables over processors so that all steps in the calculation look like figure 6. The art of distributed memory programming is to arrange things so that the most critical regions of the program look as much like this as possible.
There was a certain awkwardness to the descriptions of these three situations. In the last case one could generally rely on the compiler to allocate sub-calculations to processors in the obvious way. In the first two cases one has to hedge one's bets, and say that compiler might allocate computations over processors in various ways, but in any case the outcome is likely to be unfavourable. Given that HPF places an onus on the programmer to think about these kinds of issues, one can argue that it would have been appropriate to allow the programmer fuller control over where particular sub-computations were performed. Some of these issues were addressed in HPF 2.0 by a new ON directive.