Interaction of Communication and Parallelism

Next: Syntax of Directives Up: The HPF Model Previous: Aggregate Communication Examples

Interaction of Communication and Parallelism

The examples in Sections and were chosen so that parallelism and communication were not in conflict. The purpose of this section is to show cases where there is a tradeoff. The best implementation of all these examples will be machine dependent. As in the other sections, these examples do not necessarily reflect good programming practice.

Analyzing communication as in Sections and does not completely determine a program's performance. Consider the code: REAL x(100), y(100) !HPF DISTRIBUTE (BLOCK) ONTO procs:: x, y ... DO k = 3, 98 x(k) = y(k) * (x(k-1) + x(k) + x(k+1)) / 3.0 y(k) = x(k) + (y(k-1) + y(k-2) + y(k+1) + y(k+2)) / 4.0 ENDDO Only a few values need be communicated at the boundary of each processor. However, every iteration of the DO loop uses data computed on previous iterations for the references x(k-1), y(k-1), and y(k-2). Therefore, although there is little inherent communication, the computation will run sequentially.

In contrast, consider the following code: REAL x(100), y(100), z(100) !HPF DISTRIBUTE (BLOCK) ONTO procs:: x, y, z ... !HPF PROCESSORS procs(10) !HPF PROCESSORS procs(10) !HPF DISTRIBUTE (*,BLOCK) ONTO procs :: a, b Redistribute back to the original distributions after the DO loop. This allows parallel updates of columns of a, at the cost of two all-to-all communication operations.

Group the columns of a into blocks, then operate on the blocks separately. This strategy can produce a pipelined effect, allowing substantial parallelism. It sends many small messages to the neighboring processor rather than one large message.

Execute the vector operations sequentially. This results in totally sequential operation, but avoids overhead from process start-up and small messages.

itemize This list is not exhaustive. The optimal strategy will be highly machine dependent.

There is often a choice regarding where the result of an intermediate array expression will be stored, and different choices may lead to different communication performance. A straightforward implementation of the following code, for example, would require two transposition (communication) operations: REAL, DIMENSION(100,100) :: x, y, z !HPF ALIGN WITH x :: y, z, t1 ... t1 = y + z x = TRANSPOSE(t1) + x with only one use of transposition.

Choosing an intermediate storage location is sometimes more complex, however. Consider the following code: REAL a(1000), b(1000), c(1000), d(1000) INTEGER ix(1000) !HPF DISTRIBUTE (CYCLIC) ONTO procs:: a, b, c, d, ix ... a = b(ix) + c(ix) + d(ix) and the following implementation strategies:

On the basis of communication, the second strategy is better by a factor of 3; adding additional terms can make this factor arbitrarily large. However, that analysis does not consider parallel execution costs. If there are repeated values in ix, the second strategy may produce poor load balance. (For example, consider the case of ix(i) = 10 for all i.) Minimizing this cost is a compiler optimization and is outside the scope of this language specification.

Next: Syntax of Directives Up: The HPF Model Previous: Aggregate Communication Examples

paula@erc.msstate.edu
Thu Jul 21 17:05:43 CDT 1994