Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization

Lawrence Rauchwerger

Parasol LaboratoryCS, Texas A&M Universityhttp://parasol.tamu.edu/

Speculative Run-Time Parallelization

Outline

• Motivation

• LPRD test, a speculative run-time parallelization framework

• Recursive LRPD test for partially parallel loops

• A run-time test for loops with sparse memory accesses

• Summary and Current work

Data Dependence

Data Dependence (DD) :

• Data dependence relations are used as the essential ordering constraints among statements or operations in a program.

• Data dependence happens when two memory accesses access one memory location and at least one of them writes to the location.

.. = XX = ..

X = .... = X

X = ..X = ..

flow anti output

Three basic data dependence relations

Parallel Loop

Can a loop be executed in parallel ? • Test procedure

– FOR every pair of load/store and store/store operations: <L,S> DO– IF (L and S could access the same location in diff. Iteration)– LOOP is sequential

• For array, the memory accesses are functions of loop indices. These functions can be: linear, non-linear, unknown map.

for i = .. A[i] = A[i] + B[i]

for i = .. A[i+1] = A[i] + B[i]

A parallel loop A sequential loop

Parallelism Enabling Transformations

for i = .. temp = A[i] A[i] = B[i] B[i] = temp

for i = … private temp temp = A[i] A[i] = B[i] B[i] = temp

antioutput

• Privatization: let each iteration have separate location for ‘temp’

• ‘temp’ is used as temporary storage in each iteration

Parallel or not?

Reduction Parallelization

Reduction is• Associative and commutative operation of the form: x = x exp• x does not occur in exp or anywhere else in the loop (exceptions)• Usually extracted by pattern matching.

do i = .. A = A + B[i]

• Reduction parallelization:

pA[1:P] = 0do p = 1, P do i = my_i … pA[p] = pA[p] + B[i]do p = 1, P A = A + pA[p]

anti

outputflow

Parallel or not?

Irregular Applications: The most challenging

for i = .. A[B[i]] = A[C[i]] + B[i]

outputflow

• From compiler’s viewpoint, in irregular programs,– Data arrays are accessed via indirections, that usually are input dependent.

– Optimization, parallelization information not available statically

Problem: loop is sequential if any B[i]=C[j], i<>jOr any B[i]=B[j], i<>j

• Adaptation: matrix pivoting, adaptive mesh refinement, etc.

• Approximately, more than 50% scientific programs are irregular [Kennedy’94]

• Irregular applications involve problems defined on irregular or adaptive data structures– CFD, Molecular dynamics, sparse linear algebra.

Irregular Programs: an Example

Sparse Symmetric MVM (irregular : data accesses via indirections)

DO I=1,N ! Row ID DO K=row[I], row[I+1] J=col[K] ! Column ID B[I] += M[K] * X[J] ! M[K]=M[I,J] B[J] += M[K] * X[I] ! M[K]=M[J,I]

X X X

X X X

X X X

X X X

X X X

X

X

X

X

X

X

X

X

M X B

=xDO I=1,N ! Row ID DO J=1,N ! Column ID B[I] += M[I,J] * X[J]

Dense MVM (regular)

Run-Time Parallelization: The Idea

• Perform program analysis and optimization during program execution (run-time)

• Select dynamically between pre-optimized versions of the code

-N < K < N

do i = 1,N A[i+K]=A[i]+B[i]

T

doall i = 1,N A[i+K]=A[i]+B[i]

F

do i = 1,N A[i+K]=A[i]+B[i]

(K, N from input)

[Wolfe 97]

Outline

• Motivation





Run-time Parallelization: Irregular Apps.

Solution: instrument code to 1. Collect data access pattern (represented by W[i], R[i])2. Verify whether any data dependence could occur

Inspector/executor, Speculation

Problem: Loop is not parallel if any R[i] = W[j] , i <> j

DO i= A[ W[i] ] = A[ R[i] ] + C[i]

do i = … trace W[i] , R[i]Analyze and scheduledoall i = … A[W[i]] = A[R[i]] …

Inspector/executor

doall i = … trace W[i] , R[i] A[W[i]] = A[R[i]] …Analyzeif ( fail ) do i = …

speculation

Run-time Parallelization: Approaches

Inspector:Tracing + scheduling

Executor

End

Inspector/Executor [Saltz et al 88,91]

No

PatternChanged?

Yes

Schedule Reuse

Speculative execution+ tracing

Success ?

Test

Roll back +Sequential loop

Yes

No

End

Speculative Execution[Rauchwerger & Padua, 95]

Checkpoint

Overview of Speculative Parallelization

Speculative parallelexecution

Success ?

Error Detection(Data dependence analysis)

Checkpoint

Restore

Sequentialexecution

Polaris

Source Code

Compile time

Run-time

Static analysis

Run-time transformations

Yes

No

Run-time Optimization

• Postpones analysis and optimization until execution time– checkpoint/restoration

mechanism– error detection method to test

validity of speculation• Use actual run-time values of

program parameters affecting performance

• Select dynamically between pre-optimized versions of the code

Speculative DOALL Parallelization

Main Idea: • Speculatively execute the loop in parallel

+ Record accesses to data under test in shadows

• Afterwards, analyze if the loop was truly parallel (no actual dependences) by identifying multiple accesses to same location

Speculative parallel execution

+ MARK Read, Write

Success ?

Analysis

Checkpoint

Restore

Sequentialexecution

Yes

No

End

[Rauchwerger & Padua ’94]

W R W R

Replicated shadow of data array (A)

Merged shadow of data array

DO i= A[ W[i] ] = A[ R[i] ] + C[i]

Problem:

DOALL Test – Marking and Analysis

• Parallel speculative execution– Mark read and write operations into different private shadow

arrays. marking writeclear read mark.

– Incremental private write counter (# write operations).

• Post-speculation analysis– Merge private shadow arrays to global shadow arrays.

– Count elements have been marked write.

– If (write shadow ^ read shadow 0) exist anti or flow dep.

– If (# modified elements < # write operations) exist output dep.

LRPD Test: Main Ideas

• Lazy (value-based) Reduction Privatization DOALL test

• Errors:– Loop-carried flow dependence

– Loop-carried anti or output dep. for arrays that are not privatizable.

– Speculatively applied privatization and reduction parallelization transformations are INVALID.

[Rauchwerger & Padua ’95]

The problem: Parallelization of the following loop:

Do I=1,5

z = A(K(I)) if B(I) then A(L(I)) = z + C(I) endifEnddo

B(1:5) = (1,0,1,0,1)

K(1:5) = (1,2,3,4,1)

L(1:5) = (2,2,4,4,2)

• all iterations executed concurrently• unsafe if some A(K(i)) == A(L(j)), ij

Types of Errors:• Data Dependence Related

• Writing a memory location in different iterations • Writing and reading a memory location in different iterations

LRPD Test: an Example

Parallel Speculative Execution and Marking Phase• allocate shadow arrays Aw, Ar, Anp - one per processor • speculatively privatize A and execute loop in parallel• Record accesses to data under test in shadows

Markwrite()

If first time A(i) written in iter.• Mark Aw(i) • Clear Ar(i)• increment tw_A (write counter)

Markread()

If A(i) not already written in iter.• Mark Ar(i) • Mark Anp(i) (not privatizable)

do i = 1, 5S1 z = A[K[i]] if (B[i]) thenS2 A[L[i]] = z+C[i] endif enddo

doall i = 1, 5S1 z = A[K[i]] if (B[i]) then markread(K[i]) markwrite(L[i]) increment (tw_A)S2 A[L[i]] = z+C[i] endif enddo

LRPD Test: Marking Phase

Post-execution Analysis Phase, Detect errors (dependences) by identifying multiple accesses to same location.

• compute tm(A) = sum of marks in Aw across processors (total number of writes in distinct iterations)

• if Aw ^ Ar then loop was NOT a DOALL• if tw = tm then loop was a DOALL• if Aw ^ Anp then loop was NOT a DOALL • otherwise loop privatization was valid and loop was a DOALL

Shadow Array Attempted Counted Outcome

1 2 3 4 Tw Tm

Aw(1:4) 0 1 0 1 3 2 FAIL

Ar(1:4) 1 0 1 0

Anp(1:4) 1 0 1 0

Aw ^ Ar 0 0 0 0 Pass

Aw ^ Anp 0 0 0 0 Pass

LRPD Test: Analysis Phase

Outline

• Motivation





do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i]end do

K[1:8] = [1,2,3,1,44,2,1,1]L[1:8] = [44,5,5,44,3,5,3,3]

iter 1 2 3 4 5 6 7 8

A()

1 R R R R

2 R R

3 R W W W

4 W W R

5 W W W

For LRPD test– One data dependence can invalidate speculative parallelization.– Slowdown is proportional to speculative parallel execution time.– Partial parallelism is not exploited.

Partially Parallel Loop Example

• Main Idea– Transform a partially parallel loop into a sequence of fully

parallel, block-scheduled loops.

– Iterations before the first data dependence are correct and committed.

– Re-apply the LRPD test on the remaining iterations.

• Worst case– Sequential time plus testing overhead

[Dang, Yu and Rauchwerger’02]

The Recursive LRPD

success

Initialize

Commit

Analyze

Execute as DOALL

Checkpoint

if failure

Reinitialize

Restore

Restart

p0

p1

p2

p3

2nd stage

After 2nd stage

After 1st stage

1st stage

Block scheduled iterations

Algorithm

Example

do i = 1, 8

B[i] = f(i)

z = A[K[i]]

A[L[i]] = z + C[i]

enddo

L[1:5] = [2,2,4,4,2,1,5,5]

K[1:5] = [1,2,3,4,1,2,4,2]

start = newstart = 1; success = false; end = 8

initialize shadow array; checkpoint B

while (.not. success)

doall i = newstart, end

B[i] = f(i)

z = pA[K[i]]

pA[L[i]] = z + C[i]

markread(K[i]); markwrite(L[i])

end doall

analyze(success, newstart)

commit(pA, A, start, newstart-1)

if (.not. success) then

restore B[newstart:end]

reinitialize shadow array

endif

end while

• Implemented in run-time pass in Polaris and additional hand-inserted code.– Privatization with copy-in/copy-out for arrays under test.

– Replicated buffers for reductions.

– Backup arrays for checkpointing.

Implementation

7-85-63-41-2iter

WWW5

RWW4

WR3

WR2

RRR1

A()

P4P3P2P1proc

First Stage: Detect cross-proc. DD

7-85-6iter

W5

R4

W3

W2

R1

A()

P4P3P2P1proc

Second Stage: fully parallel

do i = 1, 8

z = A[K[i]]

A[L[i]] = z + C[i]

end do

K[1:8] = [1,2,3,1,44,2,1,1]

L[1:8] = [44,5,5,44,3,5,3,3]

Recursive LRPD Example

• Redistribute remaining iterations across processors.

• Execution time for each stage will decrease.

• Disadvantages:– May uncover new dependences across processors.

– May incur remote cache misses from data redistribution.

p1 p2 p3 p4

1st stage

After 1st stage

2nd stage

After 2nd stage

With Redistribution

p1 p2 p3 p4

1st stage

After 1st stage

2nd stage

After 2nd stage

Without Redistribution

Work Redistribution

Outline

• Motivation





1 2 3 4

Question: exist multiple accesses to same location? Two ways to log necessary info. to answer the question.

1. For each data element: which operation accessed it?– Complexity: Proportional to number of elements

2. For each memory related operation: which element did it access?– Complexity: Proportional to number of iterations

1 2 3 4Operations(iterations)

Data elements

5 6 7 8

Dense access Sparse access

Sparse Memory Accesses

Overhead of LRPD test (first way).• Marking (speculation) phase: proportional to # operations.• Analysis phase: proportional to # elements.

not efficient for loops with “sparse accesses”.

[Yu and Rauchwerger’00]

Reduce Overhead of Run-Time Test

2. List – Monotonic Access + variable stride

3. Hash Table – Random access

1. Close form – Monotonic Access + constant stride

(1, 3, 4)

– Use compacted shadow structure for loops w/ sparse access patterns (2nd way).

1 2 3 4

1 2 3 4

1 2 3 4

– Run-time library adaptively selects shadow structures among close form, list and hash.

– Compile-time technique to reduce redundant markings.

Run-Time Test for Loops with Sparse Access

• Speculative Execution– For every static marking site, mark in a temporary private

shadow structure.– At the end of each iteration, adaptively aggregate the

markings (triplet list hash table).– Overhead: proportional to # of distinct array references.

• Analysis Phase– Compares pair by pair the aggregated shadow structures.– May reduce to ranges or triplets comparison.– Overhead: proportional to the dynamic marking sites,

constant proportion of # of distinct array references.

Combine Marks

do … if (pred1) then A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = …

do … if (pred1) then Mark(A, W(i), WF) A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = …

• One mark for multiple references if:– Their subscripts are only different on a loop-inv. expression.– They are the same type among RO, RW, WR.– They have the same guarding predicates.

• Combining procedure: – Partition the subscript expressions.– Apply set operations during recursive traversal of CDG.

Outline

• Motivation





Effect of Speculative Parallelization

0

1

2

3

4

5

6

TRACK SPICE FMA3D MDG

Programs

Sp

eed

up

s 2

4

8

16

Program Techniques Coverage Suite

TRACK R-LRPD 98% Perfect

SPICE Sparse Test / R-LRPD 89% SPEC’92

FMA3D R-LRPD 71% SPEC’00

MDG LRPD 99% Perfect

Speculative Run-time Parallelization: Summary

• Run-time techniques apply program analysis and optimization transformations during program execution.

• Speculative run-time parallelization techniques (LRPD test, etc.) collect memory access history while executing a loop in parallel.

• Recursive LRPD test can speculatively parallelize any loop.

• Overhead of Run-time speculation can be further reduced by adaptively applying different shadow data structures.

The New Computing Challenge

• Today’s systems: General purpose, Heterogeneous– Poor portability, low efficiency – Need automatic system-level software support

GAUSSIAN Quantum chemistry system

CHARMM Molecular dynamics of organic systems

SPICE Circuit simulation

ASCI Multi-physics simulations

• The Challenge: Easy to Use & High Performance

• Today’s scientific applications: Bio, Multi-physics, etc– Time consuming, dynamic features and irregular data structures.– Need automatic optimization techniques to generate shorten execution.

Today: System Centric Computing

No Global Optimization

(In the interest of dynamic applications)

Compiler(static)

Application(algorithm)

System(OS & Arch)

Execution

Development,Analysis &Optimization

Input Data

• OS services are generic

• Architecture is generic

• Compilers are conservative

Application

Compiler

OS

System-Centric Computing

HW

Approach: Application Centric Computing SmartApps

Input Data

Application

Compiler

HW

OS

Application-Centric Computing

Run-time System:Execution, Analysis& Optimization

Compiler

ApplicationDevelopment,Analysis &Optimization

Architecture(reconfigurable)

OS(modular)

Compiler(run-time)

SmartApp

Application ControlInstance-specific optimization

Compiler + OS + Architecture + Data + Feedback

SmartAppsSystem Architecture

• Configurable executable

• Compiler-internal info.

Parallelizing CompilerAugmented with runtime techniques

Application

Get Runtime InformationSample input, system . etc.

Execute ApplicationContinuously monitor performance and adaptas necessary

Adaptive Software

Runtime tuning (no recompile/reconfigure)

Generate Optimal Applicationand System Configuration

Recompile Applicationand/or Reconfigure System

Smart Applicationrun-time system

Small adaptation (tuning)

Large adaptation(failure, phase change)

Related Publications

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, Lawrence Rauchwerger and David Padua, PLDI’95

Parallelizing While loops for Multiprocessor Systems, Lawrence Rauchwerger and David Padua, IPPS’95

Run-time Methods for Parallelizing Partially Parallel Loops, Lawrence Rauchwerger, Nancy Amato and David Padua, ICS’95

SmartApps: An Application Centric Approach to High Performance Computing: Compiler-Assisted Software and Hardware Support for Reduction Operations, F. Dang, M. J. Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, N. Amato, L. Rauchwerger and J. Torrellas, NSFNGS, 2002

The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops, F. Dang, H. Yu and L. Rauchwerger, IPDPS’02

Hybrid Analysis: Static & Dynamic Memory Reference Analysis, S. Rus, L. Rauchwerger and J. Hoeflinger, ICS’02

Techniques for Reducing the Overhead of Run-time Parallelization, H. Yu and L. Rauchwerger, CC’00

Adaptive Reduction Parallelization Techniques, H. Yu and L. Rauchwerger, ICS’00

http://parasol.tamu.edu/

Documents

Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization