next up previous contents
Next: The Processor Arrangement Up: High Performance Fortran Previous: High Performance Fortran   Contents

Multi-processing and distributed data

The most successful parallel computers presently available are built from a number of largely autonomous processors, each possessing its own local memory area (Figure 3). A processor can access values in its local memory very rapidly. It can also access values in another processor's memory, either by special machine instructions, or through message-passing software. But with current technology remote memory accesses are much slower (often by orders of magnitude) than local ones. This puts an onus on the programmer or compiler to minimise the number of accesses any processor makes to non-local memory. This is the communication problem.

Figure 3: Idealised picture of a distributed memory parallel computer
\begin{figure}\centerline{
\begin{picture}(280, 60)(0, 0)
\put(0, 5){Memory area...
...0, 1){20}}
\thicklines
\put(10,20){\line(0, 1){20}}
}
\end{picture}}\end{figure}

A second problem is ensuring that each processor has a fair share of the total work load. Obviously, if most work ends up being done by one processor the benefits of parallelism are lost. This is the load-balancing problem.

High Performance Fortran allows the programmer to add various directives to a program. These tell the compiler how program data is to be distributed amongst the memory areas associated with a set of (real or virtual) processors. They don't allow the programmer to state directly which processor will perform a particular computation. But it is expected that if the operands of a particular sub-computation (for example, an assignment) are all found on the same processor, the compiler will allocate that part of the computation to the processor holding the operands, whereupon no remote memory accesses will be involved.

Hence the main burden on the HPF programmer is to distribute the program's data across the processors in such a way that

These may sound like conflicting goals. Sometimes they are, and programs which look highly parallel are nevertheless difficult to distribute effectively. Luckily, equally often it is possible to perform the juggling act successfully.

To make these points a bit more concrete, imagine an in idealised situation where the a particular step in a program contains $p$ sub-calculations which can proceed in parallel. They might be the $p$ elemental assignments making up an array assignment or FORALL construct. For the sake of argument, suppose each expression to be computed involves two operands. Together with the variable being assigned, this makes three operands for the whole sub-calculation. Also, suppose there are $p$ processors available. (In general, the numbers of sub-calculations and processors are likely to be different--this case is just illustrative).

Figure 4: A data distribution leading to excessive communication
\begin{figure}\centerline{\psfig{figure=nonlocal.eps,height=2in}}\end{figure}

Figure 5: A data distribution leading to poor load balancing
\begin{figure}\centerline{\psfig{figure=unbalanced.eps,height=2in}}\end{figure}

Figure 6: An ideal data distribution
\begin{figure}\centerline{\psfig{figure=ideal.eps,height=2in}}\end{figure}

Figure 4 illustrates the situation where all operands of each sub-calculation reside in different memory areas. This means that each assignment involves at least two communications, wherever the calculation is performed.

Figure 5 illustrates a situation in which no communication is necessary, because all operands of all the assignments live on the same processor. But in this case the calculation will generally be performed where the data is, so no effective parallelism occurs. (Alternatively, the compiler might decide to share the tasks out anyway. But then all operands of tasks on processors other than the first would have to be communicated.)

Figure 6 illustrates the ideal situation, where all operands of an individual assignment occur on the same processor, but the operand groups of the sub-calculations are uniformly distributed over processors. In this case we can rely on the compiler to allocate each computation to the processor holding the operands, requiring no communication, but perfect distribution of workload.

Except in the most trivial programs, it will never be possible to choose a distribution of program variables over processors so that all steps in the calculation look like figure 6. The art of distributed memory programming is to arrange things so that the most critical regions of the program look as much like this as possible.

There was a certain awkwardness to the descriptions of these three situations. In the last case one could generally rely on the compiler to allocate sub-calculations to processors in the obvious way. In the first two cases one has to hedge one's bets, and say that compiler might allocate computations over processors in various ways, but in any case the outcome is likely to be unfavourable. Given that HPF places an onus on the programmer to think about these kinds of issues, one can argue that it would have been appropriate to allow the programmer fuller control over where particular sub-computations were performed. Some of these issues were addressed in HPF 2.0 by a new ON directive.


next up previous contents
Next: The Processor Arrangement Up: High Performance Fortran Previous: High Performance Fortran   Contents
Bryan Carpenter 2002-07-12