    Next: ON Directives Applied to Up: The ON Directive Previous: Semantics of the ON

## Examples of ON Directives

The following are valid examples of ON directives. Most of them illustrate idioms that programmers might want to use, rather than contrived situations. For simplicity, the first several examples assume the following array declarations:

```      REAL A(N), B(N), C(N), D(N)
!HPF\$ DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK), D(BLOCK)
```

One of the most commonly requested capabilities for the HPF is to control how loop iterations were assigned to processors. (Historically, the ON clause first appeared to perform exactly this role in the Kali FORALL construct.) This can be done by the ON directive, as shown in the following examples:

```      !HPF\$ INDEPENDENT
DO I = 2, N-1
!HPF\$ ON HOME(A(I))
A(I) = (B(I) + B(I-1) + B(I+1))/3
END DO

!HPF\$ INDEPENDENT
DO J = 2, N-1
!HPF\$ ON HOME(A(J+1)) BEGIN
A(J) = B(J+1) + C(J+1) + D(J+1)
!HPF\$ END ON
END DO
```

The ON directive in the I loop sets the active processor for each iteration of the loop to be the processor that stores A(I). In other words, it advises the compiler to have each processor run over its own section of the A array (and therefore B as well). The references to B(I-1) and B(I+1) must be fetched from off-processor for the first and last iterations on each processor (except for the boundary processors); note that those processors are not mentioned in the HOME clause. The ON directive in the J loop similarly sets the active set for each iteration, but advises the compiler to shift computations. As a result, each processor does a vector sum of its own sections of B, C, and D, stores the first element of the result on the processor to its left, and stores the rest of the result (shifted by one) in A. It is worth noting that the directives would still be valid (and minimize nonresident data accesses) if the arrays were distributed CYCLIC, although the number of nonresident references would be much higher.

Advice to implementors.It is highly recommended that compilers concentrate on optimizing DO loops with a single ON clause including the entire loop body. Schematically, the code will be:
```      DO i = lb, ub, stride
!HPF\$ ON HOME(array(f(i))) BEGIN
body
!HPF\$ END ON
END DO
```

where array has some data mapping. Assume the mapping gives processor p the elements myset(p). (In a BLOCK distribution, for example, myset(p) is a contiguous range of integers.) Then the generated code on processor p should be

```        DO i e [lb : ub : stride] n
f-1 (myset(p))
body
END DO
```
(This schematic does not show where communication or synchronization must be placed; that must be derived from analysis of the body.) Moreover, is likely to be the identity function or a linear function with integer coefficients, both of which can be inverted easily. Given this, techniques for iterating through the set can be found in several recent conferences. (End of advice to implementors.)

Advice to users.One can expect the I loop above to generate efficient code for the computation partitioning. In effect, the compiler will arrange for each processor to iterate over its own section of array A. The J loop is slightly more complex, since the compiler must find the inverse of the HOME clause's subscripting function. That is, the compiler must solve K=J+1 for J, where K ranges over the resident elements of A. Of course, in this case J=K-1; in general, linear functions can be inverted by the compiler. (It should be pointed out, however, that complex combinations of ALIGN and DISTRIBUTE may make the description of K unwieldy, and this may add overhead to the inversion process.) (End of advice to users.)

Sometimes it is advantageous to ``split'' an iteration between processors. The following case shows one example of this:

```      !HPF\$ INDEPENDENT
DO I = 2, N-1
!HPF\$ ON HOME(A(I))
A(I) = (B(I) + B(I-1) + B(I+1))/3
!HPF\$ ON HOME C(I+1)
C(I+1) = A(I) * D(I+1)
END DO
```

Here, the active processor sets for the two statements in the loop body are different. Due to the first ON clause, the reference to A(I) is resident in the first statement. The second ON clause makes A(I) nonresident (for some values of I) there. This maximizes the data locality in both statements, but does require data movement between the two.

Advice to implementors. If there are several non-nested ON clauses in a loop, then the schematic above needs to be generalized. In essence, the iteration range for each individual ON clause must be generated. A processor will then iterate over the union of these ranges. Statements guarded by an ON directive must now be guardad by na explicit test. In summary, the code for
```      DO i = lb, ub, stride
!HPF\$ ON HOME(array1(f1(i)))
stmt1
!HPF\$ ON HOME(array2(f2(i)))
stmt2
END DO

on processor p
set1 = [lb : ub : stride] n
f1-1 (myset1(p))
set2 = [lb : ub : stride] n
f2-1 (myset2(p))
DO i e set1 U set2
IF (i e set1) THEN
stmt1
END IF
IF (i e set1) THEN
stmt2
END IF
END DO
```

where myset1 (p) is the resident set for array1, and myset2 (p) is the resident set for array2. (Again, synchronization and communication must be handled by other means.) Code transformations such as loop distribution and loop peeling can be used to eliminate the tests in many cases. They will be particularly profitable if there are data dependences between the ON blocks. (End of advice to implementors.) Advice to users. Splitting an iteration like this is likely to require either additional tests at runtime or additional analysis by the compiler. Even if the compiler can generate low-overhead scheduling for the individual ON clauses, combining them is not necessarily low-overhead. The locality benefits must be rather substanfor this to pay off, but there are cases where multiple ON clauses are valuable. (All these statements are particularly true if one ON block uses data computed in another one.) (End of advice to users.)

Because ON clauses nest naturally, they can be useful for expressing parallelism along three different dimensions. Consider the following examples:

```      REAL X(M,M)
!HPF\$ DISTRIBUTE X(BLOCK,BLOCK)

!HPF\$ INDEPENDENT, NEW(I)
DO J = 1, M
!HPF\$ ON HOME(X(:,J)) BEGIN
DO I = 2, M
!HPF\$ ON HOME(X(I,J))
X(I,J) = (X(I-1,J) + X(I,J)) / 2
END DO
!HPF\$ END ON
END DO
```
The active processor set for each iteration of the J loop is a column of the (presumably universal) processors arrangement. The I loop further subdivides the computation, giving each processor responsibility for computing the elements it owns. Many compilers would have chosen this computation partitioning automatically for such a simple example. However, the compiler might have attempted to fully parallelize the outer loop, executing each inner loop sequentially on one processor. (This might be attractive on a machine with very fast communications.) By inserting the ON clauses, the user has advised against this strategy, thus trading additional locality for restricted parallelism. Notice that the ON directive neither requires nor implies the INDEPENDENT assertion. In both nests, each iteration of the I loop depends on the preceding iteration, but the ON directive can still partition the computation among processors. The ON directive does not automatically make a loop parallel.

Advice to implementors.``Dimension-based'' nesting, as above, will probably be a common case. The HOME clauses can be inverted at each level, treating indices from outer loops as run-time invariants. (End of advice to implementors.)

Advice to users.Nested ON directives will tend to have efficient implementations if their HOME clauses refer to different dimensions of the processors arrangements, as in the above example. This minimizes the interaction between the levels of the loops, simplifying the implementation. (End of advice to users.)

Consider the following variation on the above example:

```      !HPF\$ DISTRIBUTE Y(BLOCK,*)

!HPF\$ INDEPENDENT, NEW(I)
DO J = 1, M
!HPF\$ ON HOME(Y(:,J)) BEGIN
DO I = 2, M
!HPF\$ ON HOME(Y(I,J))
Y(I,J) = (Y(I-1,J) + Y(I,J)) / 2
END DO
!HPF\$ END ON
END DO
```

Note that the ON clauses have not changed, except for the name of the array. The interpretation is similar to the above, except that the outer ON directive assigns each iteration of the J loop to all of the processors. The inner ON directive again implements a simple owner-computes rule. The programmer has directed the compiler to distribute a serial computation across all the processors. There are a few scenarios where this is more efficient than parallelizing the outer loop:

1. Parallelizing the outer loop will generate many nonresident references, since only a part of each column is on any processor. If nonresident references are very expensive (or if M is relatively small), this overhead may outweigh any gain from parallel execution.
2. The compiler may take advantage of the INDEPENDENT directive to avoid inserting any sychronization. This allows a natural pipelined execution. A processor will execute its part of the I loop for one value of J, then immediately go on to the next J iteration. Thus, the first processor will start on J=2 wile the second receives the data it needs (from processor one) for J=1. (A similar pipeline would develop in the X example above.)

Advice to users. This example points out how ON may improve software engineering. While the "value" of HOME(X(I)) will chanve if X's mapping changes, its intent will usually stay the same - run the loop "aligned with" the array X. Moreover, the form of the clauses is portable, and they simplify experimenting with alternative computation partitioning. Both qualities are similar to the advantages of DISTRIBUTE and ALIGN over low-level data layout mechanisms. (End of advice to users.)

ON directives are particularly useful when the compiler cannot accurately estimate data locality, for example when the computation uses indirection arrays. Consider three variations of the same loop:

```      REAL X(N), Y(N)
INTEGER IX1(M), IX2(M)
!HPF\$ DISTRIBUTE X(BLOCK), Y(BLOCK)
!HPF\$ DISTRIBUTE IX(BLOCK), IY(BLOCK)

!HPF\$ INDEPENDENT
DO I = 1, N
!HPF\$ ON HOME( X(I) )
X(I) = Y(IX(I)) - Y(IY(I))
END DO

!HPF\$ INDEPENDENT
DO J = 1, N
!HPF\$ ON HOME( IX(J) )
X(J) = Y(IX(J)) - Y(IY(J))
END DO

!HPF\$ INDEPENDENT
DO K = 1, N
!HPF\$ ON HOME( X(IX(K)) )
X(K) = Y(IX(K)) - Y(IY(K))
END DO
```

In the I loop, each processor runs over its section of the X array. (That is, the active processor for iteration I is the owner of X(I).) Only the reference X(I) is guaranteed to be resident. (If M does not equal N, then IX and IY have a different block size than X, and thus a different mapping.) However, if it is usually the case that X(I), Y(IX(I)), and Y(IY(I)) are located on the same processor, then this choice of active processors may be the best available. (If X(I) and one of the other references are always on the same processor, then the programmer should add the RESIDENT clause as explained in Section 9.3.) In the next loop, iteration J's active processor is the owner of IX(J). Because IY has the same distribution as IX, reference IY(J) is always resident as well as IX(J). This is the most common array reference class in the loop, so it minimizes the number of nonresident data references in the absence of any special properties of IX and IY. It may not evenly balance the load among processors; for example, if /( N=M/2 /) then half the processors will be idle. As before, if the values in IX or IY ensure that one of the Y references is always resident, a RESIDENT assertion should be added. In the K loop, only reference Y(IX(K)) is guaranteed to be resident (because Y and X have the same distribution). However, the values stored in IX and IY may ensure that Y(IY(K)) and X(K) are always resident. Even if the three REAL values are not always, but merely ``usually'' on the same processor, this may be a good computation partitioning for both locality and parallelism. However, these advantages must be weighed against the cost of computing this partitioning. Since the HOME clause depends on a (presumably large) array of runtime values, substantial time may be required to determine which iterations are assigned to each processor. It should be clear from this discussion that there is no magic solution for handling complex computation partitionings; the best answer is usually a combination of application knowledge, careful data structure design (including ordering of the elements), and efficient compilation methodology and runtime support.

Advice to implementors.The K loop is the situation that the inspector strategy described above was designed for. If there is an outer loop around any of these examples, and that loop does not modify the distribution of X or the values of IX, then a record of each processor's iterations can be saved for reuse. The cost is at worst linear in the sizes of the arrays. (End of advice to implementors.)

Advice to users.It is unlikely that any current production compiler will generate low-overhead code for K loop. The difference from previous examples is that the HOME clause is not a function that can be easily inverted by the compiler. Some compilers may choose to execute every iteration on all processors, testing the HOME clause at run-time; others may pre-compute a list of iterations for every processor. Of course, the cost of computing the list will be substantial.

In practice, one would make all the arrays the same size to avoid some of the alignment problems above; the example was written this way for pedagogical reasons, not as an example of good data structure design. (End of advice to users.)    Next: ON Directives Applied to Up: The ON Directive Previous: Semantics of the ON