Upload
miyo
View
30
Download
0
Embed Size (px)
DESCRIPTION
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min. Introduction. Motivation - PowerPoint PPT Presentation
Citation preview
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization
and Reduction Parallelization
Lawrence Rauchwerger and David A. Padua
PLDI 1995
Presented by Seung-Jai Min
Introduction• Motivation : Current parallelizing compilers cannot handle
complex or statically insufficiently defined access patterns. ( input dependent, run-time dependent conditions, subscripted subscripts, etc…)
• LRPD Test - Speculatively executes the loop as a doall - applies a fully parallel data dependency test (x-iter.) - if the test fails, then the loop is re-executed serially
Inspector-Executor Method
• Inspector/Executor
- extract and analyze the memory access pattern
- transform the loop if necessary and execute• Disadvantage
- cost and side effect : if the address computation of the array under test depends on the actual data computation.
- parallel execution of the inspector loop is not always possible
speculative run-time parallelization
Static analysis
Run-time transformations
Polaris
Checkpoint
Speculative parallel execution
test restore
heuristic
fail
pass
reorder
sequential execution
Compile time
Run Time
Hazards(during the speculative execution)
• Exceptions
- invalidate the parallel execution
- clear the exception flag, restore the values of any altered variables, and execute serially.
• Cross-iteration dependencies in the loop
- LRPD Test
LPD Test(The Lazy Privatizing doall Test)
1. Marking Phase - For each shared array A[1:s] - read, write and not-private shadow arrays,
Ar[1:s], Aw[1:s], and Anp[1:s] (a) Uses : if this array element has not been modified,
then set corresponding elem. in Ar and Anp
(b) Defs : set corresp. elem. in Aw and clear in Ar if set.
(c) twi(A) : Count the total number of write accesses to A that are set in this iteration (i : iteration #)
LPD Test(The Lazy Privatizing doall Test)
2. Analysis Phase (Performed after the speculative exec.)
(a) Compute
(i) tw(A) = (twi(A))
(ii) tm(A) = sum(Aw[1:s])
(iii) tm(A) != tw(A) : cross iteration output depend.
(b) If any(Aw[:] & Ar[:]), then ends the phase.
: def and use values stored at the same location in different iterations (flow/anti dependency)
LPD Test(The Lazy Privatizing doall Test)
2. Analysis Phase (Performed after the speculative exec.)
(c) Else if tw(A) == tm(A), then the loop is doall
(without privatizing the array A)
(d) Else if any(Aw[:] & Anp[:]), then the array A is not privatizable.
(there is at least one iteration in which some element of A was used before modified)
(e) Otherwise, the loop was made into a doall by privatizing the shared array A.
Dynamic dead reference elimination
• To avoid introducing false dependences, the marking of the read and private shadow arrays, Ar and Anp can be postponed until the value of the shared variable is actually used.
• Definition : A dynamic dead read reference in a loop is a read access of a shared variable that does not contribute to the computation of any other shared variable which is live at loop end.
• The “lazy” marking employed by the LPD test, i.e., the dynamic dead reference elimination tech., allows it to qualify more loops than the PD test.
PD TestDo i=1, 5
z = A(K(i))
if (B1(i).eq..true.) then
A(L(i)) = z + C(i)
endif
enddo
PD test Shadow arrays tw tm
1 2 3 4
Aw
Ar 1 1 1 1
Anp 1 1 1 1
Aw(:) & Ar(:)
Aw(:) & Anp(:)
Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endifenddo
B1(1:5) = (1 0 1 0 1)
K(1:5) = (1 2 3 4 1)
L(1:5) = (2 2 4 4 2)
PD TestDo i=1, 5
z = A(K(i))
if (B1(i).eq..true.) then
A(L(i)) = z + C(i)
endif
enddo
PD test Shadow arrays tw tm
1 2 3 4
Aw 0 1 0 1 3 2
Ar 1 0 1 0
Anp 1 1 1 1
Aw(:) & Ar(:) 0 0 0 0
Aw(:) & Anp(:) 0 1 0 1
Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endifenddo
B1(1:5) = (1 0 1 0 1)
K(1:5) = (1 2 3 4 1)
L(1:5) = (2 2 4 4 2)
LPD TestDo i=1, 5
z = A(K(i))
if (B1(i).eq..true.) then
A(L(i)) = z + C(i)
endif
enddo
PD test Shadow arrays Tw tm
1 2 3 4
Aw 0 1 0 1 3 2
Ar 1 0 1 0
Anp 1 0 1 0
Aw(:) & Aw(:) 0 0 0 0
Aw(:) & Anp(:) 0 0 0 0
Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then markread(K(i)) markwrite(L(i)) A(L(i)) = z + C(i) endifenddo
B1(1:5) = (1 0 1 0 1)
K(1:5) = (1 2 3 4 1)
L(1:5) = (2 2 4 4 2)
Run-time Reduction Parallelization
• Recognition of reduction variable + Parallelizing reduction variable
• Pattern matching identification
- The DD test to qualify a statement as a reduction statement cannot be performed statically in the presence of input-dependent access patterns.
- Syntactic pattern matching cannot identify all potential reduction variables (e.g. subscripted subscripts)
The LRPD Test : Extending the LPD Test for Reduction Validation
do i = 1, nS1: A(K(i)) = ………S2: ……… = A(L(i))S3: A(R(i)) = A(R(i)) + exp() enddo
doall i = 1, n markwrite(K(i)) markredux(K(i))S1: A(K(i)) = ……… markread(L(i)) markredux(L(i))S2: ……… = A(L(i)) markwrite(R(i))S3: A(R(i)) = A(R(i)) + exp() enddo
(a) Source program
(b) transformed program
markredux operation sets the shadow array element of Anx to true
Anx : To check only that the reduction variable is not accessed outside the single reduction statement.
LRPD Test
• Modified Analysis Pass
- 2(d’) Else if any(Aw[:] & Anp[:] & Anx[:]), then some elements of A written in the loop is neither a reduction variable nor privatizable. Thus, the loop is not a doall and the phase ends.
- 2(e’) Otherwise, the loop was made into a doall by parallelizing reduction and privatization.
Performance (1)
Performance (2)
Experimental Results Summary
Other Run-time Parallelization Papers
• “Techniques for Speculative Run-Time Parallelization of Loops”, Manish, Gupta and Rahul Nim, SC’98.
- More efficient run-time array privatization - No rolling back of entire loop computation and complete the loop (by generating synchronization) - Early hazard detection
Other Run-time Parallelization Papers
• “Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors”, Ye Zhang, L., Rauchwerger, and Josep Torrellas. HPCA 1998.
- Run-time parallelization techniques are often computationally expensive and not general enough.
- Idea : execute the code in parallel speculatively and let extended cache coherence protocol hardware detect any dependence violations.
- Perf. 7.3 for 16 procs. & 50% faster than soft-only