Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization

and Reduction Parallelization

Lawrence Rauchwerger and David A. Padua

PLDI 1995

Presented by Seung-Jai Min

Introduction• Motivation : Current parallelizing compilers cannot handle

complex or statically insufficiently defined access patterns. ( input dependent, run-time dependent conditions, subscripted subscripts, etc…)

• LRPD Test - Speculatively executes the loop as a doall - applies a fully parallel data dependency test (x-iter.) - if the test fails, then the loop is re-executed serially

Inspector-Executor Method

• Inspector/Executor

- extract and analyze the memory access pattern

- transform the loop if necessary and execute• Disadvantage

- cost and side effect : if the address computation of the array under test depends on the actual data computation.

- parallel execution of the inspector loop is not always possible

speculative run-time parallelization

Static analysis

Run-time transformations

Polaris

Checkpoint

Speculative parallel execution

test restore

heuristic

fail

pass

reorder

sequential execution

Compile time

Run Time

Hazards(during the speculative execution)

• Exceptions

- invalidate the parallel execution

- clear the exception flag, restore the values of any altered variables, and execute serially.

• Cross-iteration dependencies in the loop

- LRPD Test

LPD Test(The Lazy Privatizing doall Test)

1. Marking Phase - For each shared array A[1:s] - read, write and not-private shadow arrays,

Ar[1:s], Aw[1:s], and Anp[1:s] (a) Uses : if this array element has not been modified,

then set corresponding elem. in Ar and Anp

(b) Defs : set corresp. elem. in Aw and clear in Ar if set.

(c) twi(A) : Count the total number of write accesses to A that are set in this iteration (i : iteration #)


2. Analysis Phase (Performed after the speculative exec.)

(a) Compute

(i) tw(A) = (twi(A))

(ii) tm(A) = sum(Aw[1:s])

(iii) tm(A) != tw(A) : cross iteration output depend.

(b) If any(Aw[:] & Ar[:]), then ends the phase.

: def and use values stored at the same location in different iterations (flow/anti dependency)


2. Analysis Phase (Performed after the speculative exec.)

(c) Else if tw(A) == tm(A), then the loop is doall

(without privatizing the array A)

(d) Else if any(Aw[:] & Anp[:]), then the array A is not privatizable.

(there is at least one iteration in which some element of A was used before modified)

(e) Otherwise, the loop was made into a doall by privatizing the shared array A.

Dynamic dead reference elimination

• To avoid introducing false dependences, the marking of the read and private shadow arrays, Ar and Anp can be postponed until the value of the shared variable is actually used.

• Definition : A dynamic dead read reference in a loop is a read access of a shared variable that does not contribute to the computation of any other shared variable which is live at loop end.

• The “lazy” marking employed by the LPD test, i.e., the dynamic dead reference elimination tech., allows it to qualify more loops than the PD test.

PD TestDo i=1, 5

z = A(K(i))

if (B1(i).eq..true.) then

A(L(i)) = z + C(i)

endif

enddo

PD test Shadow arrays tw tm

1 2 3 4

Aw

Ar 1 1 1 1

Anp 1 1 1 1

Aw(:) & Ar(:)

Aw(:) & Anp(:)

Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endifenddo

B1(1:5) = (1 0 1 0 1)

K(1:5) = (1 2 3 4 1)

L(1:5) = (2 2 4 4 2)

PD TestDo i=1, 5

z = A(K(i))


A(L(i)) = z + C(i)

endif

enddo

PD test Shadow arrays tw tm

1 2 3 4

Aw 0 1 0 1 3 2

Ar 1 0 1 0

Anp 1 1 1 1

Aw(:) & Ar(:) 0 0 0 0

Aw(:) & Anp(:) 0 1 0 1

Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endifenddo

B1(1:5) = (1 0 1 0 1)

K(1:5) = (1 2 3 4 1)

L(1:5) = (2 2 4 4 2)

LPD TestDo i=1, 5

z = A(K(i))


A(L(i)) = z + C(i)

endif

enddo

PD test Shadow arrays Tw tm

1 2 3 4

Aw 0 1 0 1 3 2

Ar 1 0 1 0

Anp 1 0 1 0

Aw(:) & Aw(:) 0 0 0 0

Aw(:) & Anp(:) 0 0 0 0

Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then markread(K(i)) markwrite(L(i)) A(L(i)) = z + C(i) endifenddo

B1(1:5) = (1 0 1 0 1)

K(1:5) = (1 2 3 4 1)

L(1:5) = (2 2 4 4 2)

Run-time Reduction Parallelization

• Recognition of reduction variable + Parallelizing reduction variable

• Pattern matching identification

- The DD test to qualify a statement as a reduction statement cannot be performed statically in the presence of input-dependent access patterns.

- Syntactic pattern matching cannot identify all potential reduction variables (e.g. subscripted subscripts)

The LRPD Test : Extending the LPD Test for Reduction Validation

do i = 1, nS1: A(K(i)) = ………S2: ……… = A(L(i))S3: A(R(i)) = A(R(i)) + exp() enddo

doall i = 1, n markwrite(K(i)) markredux(K(i))S1: A(K(i)) = ……… markread(L(i)) markredux(L(i))S2: ……… = A(L(i)) markwrite(R(i))S3: A(R(i)) = A(R(i)) + exp() enddo

(a) Source program

(b) transformed program

markredux operation sets the shadow array element of Anx to true

Anx : To check only that the reduction variable is not accessed outside the single reduction statement.

LRPD Test

• Modified Analysis Pass

- 2(d’) Else if any(Aw[:] & Anp[:] & Anx[:]), then some elements of A written in the loop is neither a reduction variable nor privatizable. Thus, the loop is not a doall and the phase ends.

- 2(e’) Otherwise, the loop was made into a doall by parallelizing reduction and privatization.

Performance (1)

Performance (2)

Experimental Results Summary

Other Run-time Parallelization Papers

• “Techniques for Speculative Run-Time Parallelization of Loops”, Manish, Gupta and Rahul Nim, SC’98.

- More efficient run-time array privatization - No rolling back of entire loop computation and complete the loop (by generating synchronization) - Early hazard detection

Other Run-time Parallelization Papers

• “Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors”, Ye Zhang, L., Rauchwerger, and Josep Torrellas. HPCA 1998.

- Run-time parallelization techniques are often computationally expensive and not general enough.

- Idea : execute the code in parallel speculatively and let extended cache coherence protocol hardware detect any dependence violations.

- Perf. 7.3 for 16 procs. & 50% faster than soft-only

Documents

Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min