Upload
morgan-warren
View
216
Download
2
Embed Size (px)
Citation preview
Lawrence Rauchwerger
Parasol LaboratoryCS, Texas A&M Universityhttp://parasol.tamu.edu/
Speculative Run-Time Parallelization
Outline
• Motivation
• LPRD test, a speculative run-time parallelization framework
• Recursive LRPD test for partially parallel loops
• A run-time test for loops with sparse memory accesses
• Summary and Current work
Data Dependence
Data Dependence (DD) :
• Data dependence relations are used as the essential ordering constraints among statements or operations in a program.
• Data dependence happens when two memory accesses access one memory location and at least one of them writes to the location.
.. = XX = ..
X = .... = X
X = ..X = ..
flow anti output
Three basic data dependence relations
Parallel Loop
Can a loop be executed in parallel ? • Test procedure
– FOR every pair of load/store and store/store operations: <L,S> DO– IF (L and S could access the same location in diff. Iteration)– LOOP is sequential
• For array, the memory accesses are functions of loop indices. These functions can be: linear, non-linear, unknown map.
for i = .. A[i] = A[i] + B[i]
for i = .. A[i+1] = A[i] + B[i]
A parallel loop A sequential loop
Parallelism Enabling Transformations
for i = .. temp = A[i] A[i] = B[i] B[i] = temp
for i = … private temp temp = A[i] A[i] = B[i] B[i] = temp
antioutput
• Privatization: let each iteration have separate location for ‘temp’
• ‘temp’ is used as temporary storage in each iteration
Parallel or not?
Reduction Parallelization
Reduction is• Associative and commutative operation of the form: x = x exp• x does not occur in exp or anywhere else in the loop (exceptions)• Usually extracted by pattern matching.
do i = .. A = A + B[i]
• Reduction parallelization:
pA[1:P] = 0do p = 1, P do i = my_i … pA[p] = pA[p] + B[i]do p = 1, P A = A + pA[p]
anti
outputflow
Parallel or not?
Irregular Applications: The most challenging
for i = .. A[B[i]] = A[C[i]] + B[i]
outputflow
• From compiler’s viewpoint, in irregular programs,– Data arrays are accessed via indirections, that usually are input dependent.
– Optimization, parallelization information not available statically
Problem: loop is sequential if any B[i]=C[j], i<>jOr any B[i]=B[j], i<>j
• Adaptation: matrix pivoting, adaptive mesh refinement, etc.
• Approximately, more than 50% scientific programs are irregular [Kennedy’94]
• Irregular applications involve problems defined on irregular or adaptive data structures– CFD, Molecular dynamics, sparse linear algebra.
Irregular Programs: an Example
Sparse Symmetric MVM (irregular : data accesses via indirections)
DO I=1,N ! Row ID DO K=row[I], row[I+1] J=col[K] ! Column ID B[I] += M[K] * X[J] ! M[K]=M[I,J] B[J] += M[K] * X[I] ! M[K]=M[J,I]
X X X
X X X
X X X
X X X
X X X
X
X
X
X
X
X
X
X
M X B
=xDO I=1,N ! Row ID DO J=1,N ! Column ID B[I] += M[I,J] * X[J]
Dense MVM (regular)
Run-Time Parallelization: The Idea
• Perform program analysis and optimization during program execution (run-time)
• Select dynamically between pre-optimized versions of the code
-N < K < N
do i = 1,N A[i+K]=A[i]+B[i]
T
doall i = 1,N A[i+K]=A[i]+B[i]
F
do i = 1,N A[i+K]=A[i]+B[i]
(K, N from input)
[Wolfe 97]
Outline
• Motivation
• LPRD test, a speculative run-time parallelization framework
• Recursive LRPD test for partially parallel loops
• A run-time test for loops with sparse memory accesses
• Summary and Current work
Run-time Parallelization: Irregular Apps.
Solution: instrument code to 1. Collect data access pattern (represented by W[i], R[i])2. Verify whether any data dependence could occur
Inspector/executor, Speculation
Problem: Loop is not parallel if any R[i] = W[j] , i <> j
DO i= A[ W[i] ] = A[ R[i] ] + C[i]
do i = … trace W[i] , R[i]Analyze and scheduledoall i = … A[W[i]] = A[R[i]] …
Inspector/executor
doall i = … trace W[i] , R[i] A[W[i]] = A[R[i]] …Analyzeif ( fail ) do i = …
speculation
Run-time Parallelization: Approaches
Inspector:Tracing + scheduling
Executor
End
Inspector/Executor [Saltz et al 88,91]
No
PatternChanged?
Yes
Schedule Reuse
Speculative execution+ tracing
Success ?
Test
Roll back +Sequential loop
Yes
No
End
Speculative Execution[Rauchwerger & Padua, 95]
Checkpoint
Overview of Speculative Parallelization
Speculative parallelexecution
Success ?
Error Detection(Data dependence analysis)
Checkpoint
Restore
Sequentialexecution
Polaris
Source Code
Compile time
Run-time
Static analysis
Run-time transformations
Yes
No
Run-time Optimization
• Postpones analysis and optimization until execution time– checkpoint/restoration
mechanism– error detection method to test
validity of speculation• Use actual run-time values of
program parameters affecting performance
• Select dynamically between pre-optimized versions of the code
Speculative DOALL Parallelization
Main Idea: • Speculatively execute the loop in parallel
+ Record accesses to data under test in shadows
• Afterwards, analyze if the loop was truly parallel (no actual dependences) by identifying multiple accesses to same location
Speculative parallel execution
+ MARK Read, Write
Success ?
Analysis
Checkpoint
Restore
Sequentialexecution
Yes
No
End
[Rauchwerger & Padua ’94]
W R W R
Replicated shadow of data array (A)
Merged shadow of data array
DO i= A[ W[i] ] = A[ R[i] ] + C[i]
Problem:
DOALL Test – Marking and Analysis
• Parallel speculative execution– Mark read and write operations into different private shadow
arrays. marking writeclear read mark.
– Incremental private write counter (# write operations).
• Post-speculation analysis– Merge private shadow arrays to global shadow arrays.
– Count elements have been marked write.
– If (write shadow ^ read shadow 0) exist anti or flow dep.
– If (# modified elements < # write operations) exist output dep.
LRPD Test: Main Ideas
• Lazy (value-based) Reduction Privatization DOALL test
• Errors:– Loop-carried flow dependence
– Loop-carried anti or output dep. for arrays that are not privatizable.
– Speculatively applied privatization and reduction parallelization transformations are INVALID.
[Rauchwerger & Padua ’95]
The problem: Parallelization of the following loop:
Do I=1,5
z = A(K(I)) if B(I) then A(L(I)) = z + C(I) endifEnddo
B(1:5) = (1,0,1,0,1)
K(1:5) = (1,2,3,4,1)
L(1:5) = (2,2,4,4,2)
• all iterations executed concurrently• unsafe if some A(K(i)) == A(L(j)), ij
Types of Errors:• Data Dependence Related
• Writing a memory location in different iterations • Writing and reading a memory location in different iterations
LRPD Test: an Example
Parallel Speculative Execution and Marking Phase• allocate shadow arrays Aw, Ar, Anp - one per processor • speculatively privatize A and execute loop in parallel• Record accesses to data under test in shadows
Markwrite()
If first time A(i) written in iter.• Mark Aw(i) • Clear Ar(i)• increment tw_A (write counter)
Markread()
If A(i) not already written in iter.• Mark Ar(i) • Mark Anp(i) (not privatizable)
do i = 1, 5S1 z = A[K[i]] if (B[i]) thenS2 A[L[i]] = z+C[i] endif enddo
doall i = 1, 5S1 z = A[K[i]] if (B[i]) then markread(K[i]) markwrite(L[i]) increment (tw_A)S2 A[L[i]] = z+C[i] endif enddo
LRPD Test: Marking Phase
Post-execution Analysis Phase, Detect errors (dependences) by identifying multiple accesses to same location.
• compute tm(A) = sum of marks in Aw across processors (total number of writes in distinct iterations)
• if Aw ^ Ar then loop was NOT a DOALL• if tw = tm then loop was a DOALL• if Aw ^ Anp then loop was NOT a DOALL • otherwise loop privatization was valid and loop was a DOALL
Shadow Array Attempted Counted Outcome
1 2 3 4 Tw Tm
Aw(1:4) 0 1 0 1 3 2 FAIL
Ar(1:4) 1 0 1 0
Anp(1:4) 1 0 1 0
Aw ^ Ar 0 0 0 0 Pass
Aw ^ Anp 0 0 0 0 Pass
LRPD Test: Analysis Phase
Outline
• Motivation
• LPRD test, a speculative run-time parallelization framework
• Recursive LRPD test for partially parallel loops
• A run-time test for loops with sparse memory accesses
• Summary and Current work
do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i]end do
K[1:8] = [1,2,3,1,44,2,1,1]L[1:8] = [44,5,5,44,3,5,3,3]
iter 1 2 3 4 5 6 7 8
A()
1 R R R R
2 R R
3 R W W W
4 W W R
5 W W W
For LRPD test– One data dependence can invalidate speculative parallelization.– Slowdown is proportional to speculative parallel execution time.– Partial parallelism is not exploited.
Partially Parallel Loop Example
• Main Idea– Transform a partially parallel loop into a sequence of fully
parallel, block-scheduled loops.
– Iterations before the first data dependence are correct and committed.
– Re-apply the LRPD test on the remaining iterations.
• Worst case– Sequential time plus testing overhead
[Dang, Yu and Rauchwerger’02]
The Recursive LRPD
success
Initialize
Commit
Analyze
Execute as DOALL
Checkpoint
if failure
Reinitialize
Restore
Restart
p0
p1
p2
p3
2nd stage
After 2nd stage
After 1st stage
1st stage
Block scheduled iterations
Algorithm
Example
do i = 1, 8
B[i] = f(i)
z = A[K[i]]
A[L[i]] = z + C[i]
enddo
L[1:5] = [2,2,4,4,2,1,5,5]
K[1:5] = [1,2,3,4,1,2,4,2]
start = newstart = 1; success = false; end = 8
initialize shadow array; checkpoint B
while (.not. success)
doall i = newstart, end
B[i] = f(i)
z = pA[K[i]]
pA[L[i]] = z + C[i]
markread(K[i]); markwrite(L[i])
end doall
analyze(success, newstart)
commit(pA, A, start, newstart-1)
if (.not. success) then
restore B[newstart:end]
reinitialize shadow array
endif
end while
• Implemented in run-time pass in Polaris and additional hand-inserted code.– Privatization with copy-in/copy-out for arrays under test.
– Replicated buffers for reductions.
– Backup arrays for checkpointing.
Implementation
7-85-63-41-2iter
WWW5
RWW4
WR3
WR2
RRR1
A()
P4P3P2P1proc
First Stage: Detect cross-proc. DD
7-85-6iter
W5
R4
W3
W2
R1
A()
P4P3P2P1proc
Second Stage: fully parallel
do i = 1, 8
z = A[K[i]]
A[L[i]] = z + C[i]
end do
K[1:8] = [1,2,3,1,44,2,1,1]
L[1:8] = [44,5,5,44,3,5,3,3]
Recursive LRPD Example
• Redistribute remaining iterations across processors.
• Execution time for each stage will decrease.
• Disadvantages:– May uncover new dependences across processors.
– May incur remote cache misses from data redistribution.
p1 p2 p3 p4
1st stage
After 1st stage
2nd stage
After 2nd stage
With Redistribution
p1 p2 p3 p4
1st stage
After 1st stage
2nd stage
After 2nd stage
Without Redistribution
Work Redistribution
Outline
• Motivation
• LPRD test, a speculative run-time parallelization framework
• Recursive LRPD test for partially parallel loops
• A run-time test for loops with sparse memory accesses
• Summary and Current work
1 2 3 4
Question: exist multiple accesses to same location? Two ways to log necessary info. to answer the question.
1. For each data element: which operation accessed it?– Complexity: Proportional to number of elements
2. For each memory related operation: which element did it access?– Complexity: Proportional to number of iterations
1 2 3 4Operations(iterations)
Data elements
5 6 7 8
Dense access Sparse access
Sparse Memory Accesses
Overhead of LRPD test (first way).• Marking (speculation) phase: proportional to # operations.• Analysis phase: proportional to # elements.
not efficient for loops with “sparse accesses”.
[Yu and Rauchwerger’00]
Reduce Overhead of Run-Time Test
2. List – Monotonic Access + variable stride
3. Hash Table – Random access
1. Close form – Monotonic Access + constant stride
(1, 3, 4)
– Use compacted shadow structure for loops w/ sparse access patterns (2nd way).
1 2 3 4
1 2 3 4
1 2 3 4
– Run-time library adaptively selects shadow structures among close form, list and hash.
– Compile-time technique to reduce redundant markings.
Run-Time Test for Loops with Sparse Access
• Speculative Execution– For every static marking site, mark in a temporary private
shadow structure.– At the end of each iteration, adaptively aggregate the
markings (triplet list hash table).– Overhead: proportional to # of distinct array references.
• Analysis Phase– Compares pair by pair the aggregated shadow structures.– May reduce to ranges or triplets comparison.– Overhead: proportional to the dynamic marking sites,
constant proportion of # of distinct array references.
Combine Marks
do … if (pred1) then A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = …
do … if (pred1) then Mark(A, W(i), WF) A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = …
• One mark for multiple references if:– Their subscripts are only different on a loop-inv. expression.– They are the same type among RO, RW, WR.– They have the same guarding predicates.
• Combining procedure: – Partition the subscript expressions.– Apply set operations during recursive traversal of CDG.
Outline
• Motivation
• LPRD test, a speculative run-time parallelization framework
• Recursive LRPD test for partially parallel loops
• A run-time test for loops with sparse memory accesses
• Summary and Current work
Effect of Speculative Parallelization
0
1
2
3
4
5
6
TRACK SPICE FMA3D MDG
Programs
Sp
eed
up
s 2
4
8
16
Program Techniques Coverage Suite
TRACK R-LRPD 98% Perfect
SPICE Sparse Test / R-LRPD 89% SPEC’92
FMA3D R-LRPD 71% SPEC’00
MDG LRPD 99% Perfect
Speculative Run-time Parallelization: Summary
• Run-time techniques apply program analysis and optimization transformations during program execution.
• Speculative run-time parallelization techniques (LRPD test, etc.) collect memory access history while executing a loop in parallel.
• Recursive LRPD test can speculatively parallelize any loop.
• Overhead of Run-time speculation can be further reduced by adaptively applying different shadow data structures.
The New Computing Challenge
• Today’s systems: General purpose, Heterogeneous– Poor portability, low efficiency – Need automatic system-level software support
GAUSSIAN Quantum chemistry system
CHARMM Molecular dynamics of organic systems
SPICE Circuit simulation
ASCI Multi-physics simulations
• The Challenge: Easy to Use & High Performance
• Today’s scientific applications: Bio, Multi-physics, etc– Time consuming, dynamic features and irregular data structures.– Need automatic optimization techniques to generate shorten execution.
Today: System Centric Computing
No Global Optimization
(In the interest of dynamic applications)
Compiler(static)
Application(algorithm)
System(OS & Arch)
Execution
Development,Analysis &Optimization
Input Data
• OS services are generic
• Architecture is generic
• Compilers are conservative
Application
Compiler
OS
System-Centric Computing
HW
Approach: Application Centric Computing SmartApps
Input Data
Application
Compiler
HW
OS
Application-Centric Computing
Run-time System:Execution, Analysis& Optimization
Compiler
ApplicationDevelopment,Analysis &Optimization
Architecture(reconfigurable)
OS(modular)
Compiler(run-time)
SmartApp
Application ControlInstance-specific optimization
Compiler + OS + Architecture + Data + Feedback
SmartAppsSystem Architecture
• Configurable executable
• Compiler-internal info.
Parallelizing CompilerAugmented with runtime techniques
Application
Get Runtime InformationSample input, system . etc.
Execute ApplicationContinuously monitor performance and adaptas necessary
Adaptive Software
Runtime tuning (no recompile/reconfigure)
Generate Optimal Applicationand System Configuration
Recompile Applicationand/or Reconfigure System
Smart Applicationrun-time system
Small adaptation (tuning)
Large adaptation(failure, phase change)
Related Publications
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, Lawrence Rauchwerger and David Padua, PLDI’95
Parallelizing While loops for Multiprocessor Systems, Lawrence Rauchwerger and David Padua, IPPS’95
Run-time Methods for Parallelizing Partially Parallel Loops, Lawrence Rauchwerger, Nancy Amato and David Padua, ICS’95
SmartApps: An Application Centric Approach to High Performance Computing: Compiler-Assisted Software and Hardware Support for Reduction Operations, F. Dang, M. J. Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, N. Amato, L. Rauchwerger and J. Torrellas, NSFNGS, 2002
The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops, F. Dang, H. Yu and L. Rauchwerger, IPDPS’02
Hybrid Analysis: Static & Dynamic Memory Reference Analysis, S. Rus, L. Rauchwerger and J. Hoeflinger, ICS’02
Techniques for Reducing the Overhead of Run-time Parallelization, H. Yu and L. Rauchwerger, CC’00
Adaptive Reduction Parallelization Techniques, H. Yu and L. Rauchwerger, ICS’00
http://parasol.tamu.edu/