Parallel Programming using the Iteration Space Visualizer

1

Parallel Programming using the Iteration Space Visualizer

Yijun Yu and Erik H. D'Hollander

University of Ghent, Belgiumhttp://www.elis.rug.ac.be/paris/ppt

2

Introduction Overview of the approach

interactive vs automatic Loop dependence

Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG

Visualization of … Dependence Transformations

Applications and Results Conclusion and Future work

3

Overview of the approachProgram

Code Generation

Visualize dependence

Visualize transformation

Dependence Analysis

Dataflow Analysis

ProgramTransformation

Construct the ISDG

Instrument the program

Iteration Space Visualizer Parallel Compiler

Automatic

Interactiveexact?

why?

4

Introduction (2) Overview of the approach





5

Loop Dependence Nested loops are the focus of the parallel

programming Data dependences happen when there are

multiple accesses to the same memory locations where at least one of them WRITE

Data dependence is classified as flow (first WRITE then READ), anti-flow (first READ then WRITE) or output (WRITE after WRITE)

Loop dependence is the ordering between data dependent loop iterations

6

The Iteration Space Dependence Graph (ISDG)

The object to be visualized is …ISDG = Iteration Space + Loop Dependence

An iteration I=(i1..im) is a point in the m-D iteration space, which is mapped to the 3D space

The dependent iterations I and J are linked by an arrow I J

7

An example of ISDG

do i=1,n do j=1,n do k=1,2 if(k.eq.1) then a(i,j,k)=(a(i-1,j,k)+a(i+1,j,k))/2 else a(i,j,k)=(a(i,j-1,k)+a(i,j+1,k))/2 endif enddo enddoenddo

i

j

k

(1,1,1) (1,2,1) (1,3,1)(2,1,1) (1,4,1)(2,2,1) (2,3,1)(3,1,1) (2,4,1)(3,2,1) (3,3,1)(4,1,1) (3,4,1)(4,2,1) (4,3,1) (4,4,1)

(1,1,2) (1,2,2) (1,3,2) (1,4,2)(2,1,2) (2,2,2) (2,3,2) (2,4,2)(3,1,2) (3,2,2)(4,1,2) (3,3,2) (3,4,2)(4,2,2) (4,3,2) (4,4,2)

8

Instrumentation and the ISDG construction Program instrumentation

Loop iteration: id + indices Array reference:

id + name + Read | Write + subscripts ISDG construction

1. Create the iteration points from indices2. Setup a reference list for every accessed

location3. Mark Flow-, Anti- and Output-dependence

arrows

9

Introduction (3) Overview of the approach





10

Dependence Visualization Loop visualization

3D view-port of Iteration space Graphical operations

Detecting and enhancing parallelism Automatic parallelization Maximal parallelism detection Parallelization by plane execution

11

Loop Visualization Visualization of the ISDG

Points + Arrows + Colors + Labels + Axes 3D view-port of Iteration space

=3D, >3D and < 3Dprojection (condensed points and arrows)expansion (dummy index dimension)

ISDG operations Graphical operations: rotate, move and

animate Query dialogs: selection, variable zooming

and dependence type filtering, etc.

12

Automatic Parallelization Sequential execution

Traverse the iteration space in lexicographical order and count the iterations TSeq

Parallel execution Traverse the iterations in a marked loop in parallel and

count the steps Tpar

Report speedup Spara = Tseq / Tpar

Automatic parallelization Test whether the dependence ordering is kept for all

combinations of loop parallelizations :DOALLi1,i2,i3?+DOALLi1,i2?+DOALLi1,i3? + DOALLi2,i3?+DOALLi1?+DOALLi2?+DOALLi3?

13

Maximal Parallelism Detection Data-flow order

An iteration is executed as soon as its data are ready, i.e. after all the dependent iterations are carried out

The iterations of the same delay are executed at the same time, i.e. in parallel

The dependent iterations are executed sequentially. Count the steps Tdf

Minimal executing time = Maximal parallelism

Maximal speedup Smax = Tseq/Tdf

14

Plane Parallelization Define a cutting plane Ax+By+Cz=D

Clicking at three points Giving parameters A,B,C,D

Plane execution Traverse the planes d0 Ax+By+Cz<d0+Td

along the normal vector (A,B,C) Plane parallelization

Matching the dataflow execution may enhance speedup Splane=Tseq/Td

Verified by cross-plane dependence checking or 3D->2D projection checking

15

Dependence Visualization procedural summary

Spara=Sdf?

Start

Maximal parallelism detection Sdf

Automatic parallelization Spara

Prune false dependences

End

Yes

Plane parallelization Splane

Splane>Spara?

No

NoYes

Program transformation

16

Program Transformations

When Sdf>Spara, loop transformations may enhance the parallelism of the target loop…

Unimodular Loop Transformations Why? 3D 3D, 1-to-1, etc.

Loop Projections and Expansions Loop Projection: >3D 3D Loop Expansion: <3D 3D

17

Unimodular Transformations

?

?

?

?

?

?

?

?

?

Normal vector

(A,B,C)

A

B

C

?

?

?

?

?

?

A

B

C

!

!

!

!

!

!

•Unimodular

•Legality

Look for a suitable transformation Interactive way

Automatic way Possible when array index expression are linear

and all the distance vectors lie in a plane Extract largest base vectors of the dependence

distances and construct the transformation (pseudo distance matrix approach)

18

Loop Expansion Non-perfectly vs perfectly nested loop Statement vs Iteration-level parallelism Statement reordering affine remapping Loop expansion

Use additional dimension to index the statements in the loop body

Unimodular loop transformations are still applicable at the statement level

19

Introduction Overview of the approach





20

Application and Results Gauss-Jordan:

linear system solver Lim’s example:

statement-level parallelism Cholesky kernel:

loop projection CFD application:

unimodular transformation

21

Gauss-Jordan elimination do i=1,n do j=1,n if(i.ne.j) then f=a(j,i)/a(i,i)C$doisv do k=i+1,n+1 a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo

id=0 do i = 1,n do j = 1,n if (i.ne.j) then

write(11,*) id+1," r ","a",2,j,i write(11,*) id+1," r ","a",2,i,i write(11,*) id+1," w ","f"," 1 0 " f=a(j,i)/a(i,i) do k = i+1,n

id=id+1 write(11,*) id,i,j,k write(11,*) id," r ","a",2,j,k write(11,*) id," r ","f"," 1 0 " write(11,*) id," r ","a",2,i,k write(11,*) id," w ","a",2,j,k a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo

22

Gauss-Jordan elimination

Plane: I = 1

DOALL J, K validSeq. time: 30 Dataflow: 4, Speedup: 7.5

Loop time: 4, Speedup: 7.5

IJ

K

(1,4,2)

(2,4,3)

(2,4,4) (3,4,4)

(3,4,5)

(1,4,3)

(2,4,5)

(1,4,4)

(1,4,5)(4,3,5)

(1,3,2)

(2,3,3)(1,3,3)

(1,3,4) (2,3,4)

(2,3,5)(1,3,5)

(1,2,2)

(3,2,4)

(4,2,5)(3,2,5)

(1,2,3)

(1,2,4)

(1,2,5)

(2,1,3)

(2,1,4) (3,1,4)

(3,1,5) (4,1,5)(2,1,5)

23

Lim’s Example

The original program do l1=1,n do l2=1,n a(l1,l2)=a(l1,l2)+b(l1-1,l2) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo

do l1=1,n do l2=1,nc$doisv do l3=0,1 if(l3.eq.0) a(l1,l2)=a(l1,l2)+b(l1-1,l2) if(l3.eq.1) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddo

Loop Expansion

24

Lim’s example unimodular transformation

1

-1 1

1

0

0

0

1

0

Plane:L1-L2+L3=0

DOALL L3 validSeq. time: 32 Dataflow: 7, Speedup: 4.57

Loop time: 16, Speedup: 2.00

l1

l2

l3

i1

i2i3

Plane: i1 = 0

DOALL i1 validSeq. time: 32 Dataflow: 7, Speedup: 4.57

Loop time:7, Speedup: 4.57

25

Lim’s exampleCode generation

C The unimodular transformed code doall i1 = 1-n, n do i2 = max(i1,1), min(n,i1+n) do i3 = max(-i1+i2,1), min(-i1+i2+1,n) l1 = i2 l2 = i3 l3 = i1 - i2 + i3 if (l3.eq.1)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if (l3.eq.2)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddoall

1

-1

1

1

0

0

0

1

0FourierMotzkin

0

1

0

0

0

1

1

-1 1

Inversion

26


symbolic n;IS1:={[i,j,k]:1<=i,j<=n && k=0};IS2:={[i,j,k]:1<=i,j<=n && k=1};T1:={[i,j,k]->[i-j+k,i,j]};T2:={[i,j,k]->[i-j+k,i,j]};codegen 0 T1:IS1,T2:IS2;

1

-1

1

1

0

0

0

1

0

I’ = I – J + KJ’ = IK’= J

27


1

-1

1

1

0

0

0

1

0

C the optimized code by Omega calculator doall p = 1-n, n if (p.ge.1)b(p,1) = a(p,0) * b(p,1) do l1 = max(p+1,1), min(p+n-1,n) a(l1,l1-p) =a(l1,l1-p)+b(l1-1,l1-p) a(l1,l1-p+1)=a(l1,l1-p)*b(l1,l1-p+1) enddo if (p.le.0)a(p+n,n)=a(p+n,n)+b(p+n-1,n) enddoall

28

Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF1 CONTINUE

C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K)

Loop Fusion

2929

Cholesky Kernel

29

(I,K,J ,L)

IK

J

Plane: L=0

I

KL

Loop Projections

30

Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0C$DOISV

DOALL 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF1 CONTINUE

(L,I,K,J)

31

CFD application Computation Fluid Dynamics CFD

Navier-Stokes equations Successive Over-Relaxation SOR Kernel 3D loop: difficult to analyze

172 array references/iteration33 if-branches/iteration

Unimodular transformation found!

32

CFD Application

Range:I1’= 6,24I2’= 1, 4I3’= 1, 4

Plane: i1’=9

Seq. timeDOALL i2’,i3’

Dataflow: 19, Speedup: 3.37Loop time:19,Speedup: 3.37

I1’

I2’

I3’

(9,1,1)

(9,2,1)

(9,1,2)

Range:i1= 1, 4i2= 1, 4i3= 1, 4

Plane: 3 i1+2 i2+i3=9

Seq. time: 64 Dataflow: 19, Speedup: 3.37Loop time: 64, Speedup: 1.00

i1

i2

i3

(2,1,1)

(1,2,2)

(1,1,4)

3

2

1

0

1

0

1

0

0

33

Conclusion and Future work Allowing the exact visualization of real

program loops Assistance with detecting parallel loops Estimation of maximal speedup using

dataflow execution Assistance with finding suitable loop

transformations Future work:

Seemless Integration into PPT (parallel programming environment)

34

THANKS For you attention!

Any question?

Documents

Parallel Programming using the Iteration Space Visualizer