Next: Syntax of Directives
Up: The HPF Model
Previous: Aggregate Communication Examples
The examples in Sections
and
were chosen
so that parallelism and communication were not in conflict.
The purpose of this section is to show cases where there is a tradeoff.
The best implementation of all these examples will be machine dependent.
As in the other sections, these examples do not necessarily reflect good
programming practice.
Analyzing communication as in Sections
and
does not completely determine a program's performance.
Consider the code:
REAL x(100), y(100)
!HPF
DISTRIBUTE (BLOCK) ONTO procs:: x, y ... DO k = 3, 98 x(k) = y(k) * (x(k-1) + x(k) + x(k+1)) / 3.0 y(k) = x(k) + (y(k-1) + y(k-2) + y(k+1) + y(k+2)) / 4.0 ENDDO Only a few values need be communicated at the boundary of each processor. However, every iteration of the DO loop uses data computed on
previous iterations
for the references x(k-1), y(k-1), and y(k-2).
Therefore, although there is little inherent communication, the
computation will run sequentially.
In contrast, consider the following code: REAL x(100), y(100), z(100) !HPF
DISTRIBUTE (BLOCK) ONTO procs:: x, y, z
...
!HPF
PROCESSORS procs(10)
!HPF
PROCESSORS procs(10)
!HPF
DISTRIBUTE (*,BLOCK) ONTO procs :: a, b
Redistribute back to the original distributions after the DO loop.
This allows parallel updates of columns of a, at the cost of two
all-to-all communication operations.
Group the columns of a into blocks, then operate on the
blocks separately. This strategy can produce a pipelined effect,
allowing substantial parallelism. It sends many small messages to the
neighboring processor rather than one large message.
Execute the vector operations sequentially.
This results in totally sequential operation, but avoids overhead from
process start-up and small messages.
itemize
This list is not exhaustive.
The optimal strategy will be highly machine dependent.
There is often a choice regarding where the result of an intermediate
array expression will be stored, and different choices may lead to
different communication performance.
A straightforward implementation of the following code, for example,
would require two transposition (communication) operations:
REAL, DIMENSION(100,100) :: x, y, z
!HPF
ALIGN WITH x :: y, z, t1
...
t1 = y + z
x = TRANSPOSE(t1) + x
with only one use of transposition.
Choosing an intermediate storage location is sometimes more complex, however.
Consider the following code:
REAL a(1000), b(1000), c(1000), d(1000)
INTEGER ix(1000)
!HPF
DISTRIBUTE (CYCLIC) ONTO procs:: a, b, c, d, ix
...
a = b(ix) + c(ix) + d(ix)
and the following implementation strategies:
On the basis of communication, the second strategy is better by a factor
of 3; adding additional terms can make this factor arbitrarily
large.
However, that analysis does not consider parallel
execution costs.
If there are repeated values in ix, the second strategy may produce
poor load balance.
(For example, consider the case of
ix(i) = 10 for all i.)
Minimizing this cost is a compiler optimization and is outside the scope
of this language specification.
Next: Syntax of Directives
Up: The HPF Model
Previous: Aggregate Communication Examples
paula@erc.msstate.edu
Thu Jul 21 17:05:43 CDT 1994