Simple Communication Examples Up: High Performance Fortran Previous: Fortran 90

The HPF Model

An important goal of HPF is to achieve code portability across a variety of parallel machines. This requires not only that HPF programs compile on all target machines, but also that a highly-efficient HPF program on one parallel machine be able to achieve reasonably high efficiency on another parallel machine with a comparable number of processors. Otherwise, the effort spent by a programmer to achieve high performance on one machine would be wasted when the HPF code is ported to another machine. Although SIMD processor arrays, MIMD shared-memory machines, and MIMD distributed-memory machines use very different low-level primitives, there is broad similarity with respect to the fundamental factors that affect the performance of parallel programs on these machines. Thus, achieving high efficiency across different parallel machines with the same high level HPF program is a feasible goal. While describing a full execution model is beyond the scope of this language specification, we focus here on two fundamental factors and show how HPF relates to them:

The quantitative cost associated with each of these factors is machine dependent; vendors are strongly encouraged to publish estimates of these costs in their system documentation. Note that, like any execution model, these may not reflect all of the factors relevant to performance on a particular architecture.

The parallelism in a computation can be expressed in HPF by the following constructs:

These features allow a user to specify explicitly potential data parallelism in a machine-independent fashion. The purpose of this section is to clarify some of the performance implications of these features, particularly when they are combined with the HPF data distribution features. In addition, EXTRINSIC procedures provide an escape mechanism in HPF to allow the use of efficient machine-specific primitives by using another programming paradigm. Because the resulting model of computation is inherently outside the realm of data-parallel programming, we will not discuss this feature further in this section.

A compiler may choose not to exploit information about parallelism, for example because of lack of resources or excessive overhead. In addition, some compilers may detect parallelism in sequential code by use of dependence analysis. This document does not discuss such techniques.

The interprocessor or inter-memory data communication that occurs during the execution of an HPF program is partially determined by the HPF data distribution directives in Section . The compiler will determine the actual mapping of data objects to the physical machine and will be guided in this by the directives. The actual mapping and the computation specified by the program determine the needed actual communication, and the compiler will generate the code required to perform it. In general, if two data references in an expression or assignment are mapped to different processors or memory regions then communication is required to bring them together. The following examples illustrate how this may occur.

Clearly, there is a tradeoff between parallelism and communication. If all the data are mapped to one processor's local memory, then a sequential computation with no communication is possible, although the memory of one processor may not suffice to store all the program's data. Alternatively, mapping data to multiple processors' local memories may permit computational parallelism but also may introduce communications overhead. The optimal resolution of such conflicts is very dependent on the architecture and underlying system software.

The following examples illustrate simple cases of communication, parallelism, and their interaction. Note that the examples are chosen for illustration and do not necessarily reflect efficient data layouts or computational methods for the program fragments shown. Rather, the intent is to derive lower bounds on the amount of communication that are needed to implement the given computations as they are written. This gives some indication of the maximum possible efficiency of the computations on any parallel machine. A particular system may not achieve this efficiency due to analysis limitations, or may disregard these bounds if other factors determine the performance of the code.