A few years ago a discussion about parallel computing might start with words to the effect that ``today it seems that serial processors are approaching their technological limits ...'', and infer that therefore parallelism would be needed imminently to reach higher performance1. In the mean time, Moore's law, which predicts that that sequential processors will double in performance about every eighteen months, has carried on relentlessly. It is a commonplace observation is that today's personal computers have as much processing power as many of the parallel supercomputers we worked on a decade or so ago.
A school of thought that was influential in the supercomputing field for some time held that the most promising parallel computers were those with a modest number of powerful processing nodes. Such a computer can provide results several times faster than its individual sequential nodes. However it can't do much better than that. An obvious question is whether people will invest effort in parallelizing their software for this kind of platform, when they need only wait a few years for Moore's law give them the same speedup. The answer seems to be ``not many''. As long as Moore's law operates for sequential processors it seems likely that parallelism must offer orders of magnitude improvement in performance, otherwise it is hardly worth the investment needed to exploit it. This suggests that successful parallelism better be massive parallelism, involving many nodes approximately as powerful as, or more poweful than, contemporary PCs.
In this climate there are various places where parallel computing can be successfully applied--for example in large dedicated clusters of commodity computers (government labs, Pixar RenderFarm, ...) or by harvesting cycles of many processors over the Internet. Some of these applications most likely involve large-scale ``task farming'', applicable when a computation can be divided into a large number of completely independent computational tasks. The other prevalent form of massive parallelism is called ``data parallelism''. The term is usually reserved for the situation where a computation involves some large data-structures, typically arrays, that are split across nodes. Each node performs similar computations on a different part of the data structure. The computations at each node may require some intermediate results from peer nodes. For data parallelism to work best the volume of communicated values should be small compared with the volume of locally computed results. Pretty much all succesful applications of massive parallelism can be characterized as either task farming or data parallel.
In the case of task-farming the level of parallelism is normally coarse-grained. The individual task is a sequential program operating on local storage. A few parameters must be read from a remote controller at the start of the task and a few results returned at the end. This style of programming fits naturally in the framework of conventional sequential programming languages: a task can be abstracted as a function or method; the complex, problem-independent infrastructure for communication and load-balancing can be abstracted as a library.
In the data-parallel case the situation is slightly different. While it is possible to write data-parallel programs for modern parallel computers using ordinary sequential languages supported by communication libraries, there is a long and quite successful history of special languages for data parallel computing. Why is data-parallel programming a special case--why should it warrant its own languages?
Historically a strong motivation for the development of data parallel languages came from the association of this sort of parallelism with Single Instruction Multiple Data (SIMD) computer architectures. In a SIMD computer a single controlling processor reads the program code. The controller broadcasts every instruction to large number of compute nodes, and each executes the same instruction on its local data. The content of an individual instruction depends on the detailed architecture, but a typical example might be an instruction to add together all locally held elements of two vectors distributed over the compute nodes, one element per node.
For much of the time SIMD computers were in vogue (roughly the 1970s to the early 90s) standard sequential languages like Fortran 77 had no features to support this model of execution. Probably the best one could do in a conventional language would be to map the special parallel instructions to individual procedure calls, but that would be tantamount to writing parallel assembly code. To make these machines programable, it was very desirable to extend the sequential languages available at the time.
The early data parallel computer languages developed for machines like the Illiac IV and the ICL DAP were quite elegant, popular with the scientific programmers who used them, and successful in the sense that they allowed the machines to be programmed efficiently by people who would not willingly use a parallel assembly language. They introduced the new programming-language concept of a distributed or parallel array. Typically the set of semantic operations allowed on a distributed array was somewhat different to the operations allowed on a sequential array, but this wasn't really a problem because the compiler could check for abuse. A more serious problem that each data parallel language had features tied to a particular manufacturer's parallel computer architecture. A particularly interesting series of languages (*LISP, C*, CM Fortran) was developed by Thinking Machines Corporation to support their Connection Machine series of computers.
In the 1980s and 90s microprocessors grew in power and availability, and fell in price. Building SIMD computers out of simple but specialized compute nodes gradually became less economical than putting a general purpose commodity microprocessor at every node. Eventually SIMD computers were displaced almost completely by Multiple Instruction Multiple Data (MIMD) parallel computer architectures. In a MIMD architecture every compute node directly executes its own program text. Different nodes can be executing different parts of the same program (or different programs altogether) at the same time.
The asynchronous operation of MIMD computers makes them extremely flexible. For example they are very well suited to the task-farming kind of parallelism, which is barely feasible at all on SIMD computers. But the same asynchrony makes general programming of MIMD parallel computers hard. In SIMD programming issues of synchronization is rarely an issue--every collective step is trivially synchronized by the action of the controller. Programming a MIMD computer on the other hand typically requires some level of mastery of concurrent programming, a hard discipline. A major concern is the explicit use of synchronization primitives to control nondeterministic behaviour. Nondeterminism arises almost inevitably in a system that has multiple independent threads of control, for example through so-called race conditions. Careful synchronization can control nondeterminism. Unfortunately careless synchronization easily leads to a new problem: deadlock.
A common style of writing data parallel programs for MIMD computers soon emerged. This style has some similarities to programming a SIMD computer. There is no universally accepted, formal definition of this style, but it is associated with phrases like Single Program Multiple Data (SPMD) programming, and the Loosely Synchronous Model. In this style, although there is no central controller, the worker nodes carry on doing essentially the same thing at essentially the same time. Instead of central copies of control variables stored on the control processor of a SIMD computer, control variables (iteration counts and so on) are usually stored in a replicated fashion across MIMD nodes. Each node has its own local copy of these global control variables, but every node updates them in an identical way. There are no centrally issued parallel instructions, but communications usually happen in well-defined collective phases. These data exchanges occur in a rendezvous that explicitly or implicitly2synchronizes the peer nodes. The situation is something like an orchestra without a conductor. There is no central control, but each individual is playing from the same script. The group as a whole stays in lockstep. This loosely synchronous style has some similarities to the Bulk Synchronous Parallel (BSP) model of computing introduced by the theorist Les Valiant in the early 1990s. The restricted pattern of collective synchronization is easier to deal with than the complex synchronization problems of general concurrent programming.
A natural assumption was that it should be possible and not too difficult to capture the SPMD model for programming MIMD computers in data-parallel languages, along lines similar the successful SIMD languages. Various research prototype languages attempted to do this, with some success. By the 90s the value of portable, standarized programming languages was universally recognized, and there seemed to be some consensus about what a standard language for SPMD programming ought to look like. The High Performance Fortran (HPF) standard was born.
The remainder of this lecture will review some of these steps on the road to HPF in more detail. Several earlier languages that had HPF-like features will be described. There have been many data-parallel languages. The selection here is somewhat arbitrary, but was partly motivated by the desire to present practical languages that appear to have been widely used by ``real programmers''. Many important research contributions are missing.