Upload
atira
View
58
Download
0
Embed Size (px)
DESCRIPTION
Parallel Programming using the Iteration Space Visualizer. Yijun Yu and Erik H. D'Hollander University of Ghent, Belgium http://www.elis.rug.ac.be/paris/ppt. Introduction. Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG - PowerPoint PPT Presentation
Citation preview
1
Parallel Programming using the Iteration Space Visualizer
Yijun Yu and Erik H. D'Hollander
University of Ghent, Belgiumhttp://www.elis.rug.ac.be/paris/ppt
2
Introduction Overview of the approach
interactive vs automatic Loop dependence
Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG
Visualization of … Dependence Transformations
Applications and Results Conclusion and Future work
3
Overview of the approachProgram
Code Generation
Visualize dependence
Visualize transformation
Dependence Analysis
Dataflow Analysis
ProgramTransformation
Construct the ISDG
Instrument the program
Iteration Space Visualizer Parallel Compiler
Automatic
Interactiveexact?
why?
4
Introduction (2) Overview of the approach
interactive vs automatic Loop dependence
Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG
Visualization of … Dependence Transformations
Applications and Results Conclusion and Future work
5
Loop Dependence Nested loops are the focus of the parallel
programming Data dependences happen when there are
multiple accesses to the same memory locations where at least one of them WRITE
Data dependence is classified as flow (first WRITE then READ), anti-flow (first READ then WRITE) or output (WRITE after WRITE)
Loop dependence is the ordering between data dependent loop iterations
6
The Iteration Space Dependence Graph (ISDG)
The object to be visualized is …ISDG = Iteration Space + Loop Dependence
An iteration I=(i1..im) is a point in the m-D iteration space, which is mapped to the 3D space
The dependent iterations I and J are linked by an arrow I J
7
An example of ISDG
do i=1,n do j=1,n do k=1,2 if(k.eq.1) then a(i,j,k)=(a(i-1,j,k)+a(i+1,j,k))/2 else a(i,j,k)=(a(i,j-1,k)+a(i,j+1,k))/2 endif enddo enddoenddo
i
j
k
(1,1,1) (1,2,1) (1,3,1)(2,1,1) (1,4,1)(2,2,1) (2,3,1)(3,1,1) (2,4,1)(3,2,1) (3,3,1)(4,1,1) (3,4,1)(4,2,1) (4,3,1) (4,4,1)
(1,1,2) (1,2,2) (1,3,2) (1,4,2)(2,1,2) (2,2,2) (2,3,2) (2,4,2)(3,1,2) (3,2,2)(4,1,2) (3,3,2) (3,4,2)(4,2,2) (4,3,2) (4,4,2)
8
Instrumentation and the ISDG construction Program instrumentation
Loop iteration: id + indices Array reference:
id + name + Read | Write + subscripts ISDG construction
1. Create the iteration points from indices2. Setup a reference list for every accessed
location3. Mark Flow-, Anti- and Output-dependence
arrows
9
Introduction (3) Overview of the approach
interactive vs automatic Loop dependence
Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG
Visualization of … Dependence Transformations
Applications and Results Conclusion and Future work
10
Dependence Visualization Loop visualization
3D view-port of Iteration space Graphical operations
Detecting and enhancing parallelism Automatic parallelization Maximal parallelism detection Parallelization by plane execution
11
Loop Visualization Visualization of the ISDG
Points + Arrows + Colors + Labels + Axes 3D view-port of Iteration space
=3D, >3D and < 3Dprojection (condensed points and arrows)expansion (dummy index dimension)
ISDG operations Graphical operations: rotate, move and
animate Query dialogs: selection, variable zooming
and dependence type filtering, etc.
12
Automatic Parallelization Sequential execution
Traverse the iteration space in lexicographical order and count the iterations TSeq
Parallel execution Traverse the iterations in a marked loop in parallel and
count the steps Tpar
Report speedup Spara = Tseq / Tpar
Automatic parallelization Test whether the dependence ordering is kept for all
combinations of loop parallelizations :DOALLi1,i2,i3?+DOALLi1,i2?+DOALLi1,i3? + DOALLi2,i3?+DOALLi1?+DOALLi2?+DOALLi3?
13
Maximal Parallelism Detection Data-flow order
An iteration is executed as soon as its data are ready, i.e. after all the dependent iterations are carried out
The iterations of the same delay are executed at the same time, i.e. in parallel
The dependent iterations are executed sequentially. Count the steps Tdf
Minimal executing time = Maximal parallelism
Maximal speedup Smax = Tseq/Tdf
14
Plane Parallelization Define a cutting plane Ax+By+Cz=D
Clicking at three points Giving parameters A,B,C,D
Plane execution Traverse the planes d0 Ax+By+Cz<d0+Td
along the normal vector (A,B,C) Plane parallelization
Matching the dataflow execution may enhance speedup Splane=Tseq/Td
Verified by cross-plane dependence checking or 3D->2D projection checking
15
Dependence Visualization procedural summary
Spara=Sdf?
Start
Maximal parallelism detection Sdf
Automatic parallelization Spara
Prune false dependences
End
Yes
Plane parallelization Splane
Splane>Spara?
No
NoYes
Program transformation
16
Program Transformations
When Sdf>Spara, loop transformations may enhance the parallelism of the target loop…
Unimodular Loop Transformations Why? 3D 3D, 1-to-1, etc.
Loop Projections and Expansions Loop Projection: >3D 3D Loop Expansion: <3D 3D
17
Unimodular Transformations
?
?
?
?
?
?
?
?
?
Normal vector
(A,B,C)
A
B
C
?
?
?
?
?
?
A
B
C
!
!
!
!
!
!
•Unimodular
•Legality
Look for a suitable transformation Interactive way
Automatic way Possible when array index expression are linear
and all the distance vectors lie in a plane Extract largest base vectors of the dependence
distances and construct the transformation (pseudo distance matrix approach)
18
Loop Expansion Non-perfectly vs perfectly nested loop Statement vs Iteration-level parallelism Statement reordering affine remapping Loop expansion
Use additional dimension to index the statements in the loop body
Unimodular loop transformations are still applicable at the statement level
19
Introduction Overview of the approach
interactive vs automatic Loop dependence
Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG
Visualization of … Dependence Transformations
Applications and Results Conclusion and Future work
20
Application and Results Gauss-Jordan:
linear system solver Lim’s example:
statement-level parallelism Cholesky kernel:
loop projection CFD application:
unimodular transformation
21
Gauss-Jordan elimination do i=1,n do j=1,n if(i.ne.j) then f=a(j,i)/a(i,i)C$doisv do k=i+1,n+1 a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo
id=0 do i = 1,n do j = 1,n if (i.ne.j) then
write(11,*) id+1," r ","a",2,j,i write(11,*) id+1," r ","a",2,i,i write(11,*) id+1," w ","f"," 1 0 " f=a(j,i)/a(i,i) do k = i+1,n
id=id+1 write(11,*) id,i,j,k write(11,*) id," r ","a",2,j,k write(11,*) id," r ","f"," 1 0 " write(11,*) id," r ","a",2,i,k write(11,*) id," w ","a",2,j,k a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo enddo
22
Gauss-Jordan elimination
Plane: I = 1
DOALL J, K validSeq. time: 30 Dataflow: 4, Speedup: 7.5
Loop time: 4, Speedup: 7.5
IJ
K
(1,4,2)
(2,4,3)
(2,4,4) (3,4,4)
(3,4,5)
(1,4,3)
(2,4,5)
(1,4,4)
(1,4,5)(4,3,5)
(1,3,2)
(2,3,3)(1,3,3)
(1,3,4) (2,3,4)
(2,3,5)(1,3,5)
(1,2,2)
(3,2,4)
(4,2,5)(3,2,5)
(1,2,3)
(1,2,4)
(1,2,5)
(2,1,3)
(2,1,4) (3,1,4)
(3,1,5) (4,1,5)(2,1,5)
23
Lim’s Example
The original program do l1=1,n do l2=1,n a(l1,l2)=a(l1,l2)+b(l1-1,l2) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo
do l1=1,n do l2=1,nc$doisv do l3=0,1 if(l3.eq.0) a(l1,l2)=a(l1,l2)+b(l1-1,l2) if(l3.eq.1) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddo
Loop Expansion
24
Lim’s example unimodular transformation
1
-1 1
1
0
0
0
1
0
Plane:L1-L2+L3=0
DOALL L3 validSeq. time: 32 Dataflow: 7, Speedup: 4.57
Loop time: 16, Speedup: 2.00
l1
l2
l3
i1
i2i3
Plane: i1 = 0
DOALL i1 validSeq. time: 32 Dataflow: 7, Speedup: 4.57
Loop time:7, Speedup: 4.57
25
Lim’s exampleCode generation
C The unimodular transformed code doall i1 = 1-n, n do i2 = max(i1,1), min(n,i1+n) do i3 = max(-i1+i2,1), min(-i1+i2+1,n) l1 = i2 l2 = i3 l3 = i1 - i2 + i3 if (l3.eq.1)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if (l3.eq.2)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddo enddoall
1
-1
1
1
0
0
0
1
0FourierMotzkin
0
1
0
0
0
1
1
-1 1
Inversion
26
Lim’s exampleCode generation
symbolic n;IS1:={[i,j,k]:1<=i,j<=n && k=0};IS2:={[i,j,k]:1<=i,j<=n && k=1};T1:={[i,j,k]->[i-j+k,i,j]};T2:={[i,j,k]->[i-j+k,i,j]};codegen 0 T1:IS1,T2:IS2;
1
-1
1
1
0
0
0
1
0
I’ = I – J + KJ’ = IK’= J
27
Lim’s exampleCode generation
1
-1
1
1
0
0
0
1
0
C the optimized code by Omega calculator doall p = 1-n, n if (p.ge.1)b(p,1) = a(p,0) * b(p,1) do l1 = max(p+1,1), min(p+n-1,n) a(l1,l1-p) =a(l1,l1-p)+b(l1-1,l1-p) a(l1,l1-p+1)=a(l1,l1-p)*b(l1,l1-p+1) enddo if (p.le.0)a(p+n,n)=a(p+n,n)+b(p+n-1,n) enddoall
28
Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF1 CONTINUE
C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K)
Loop Fusion
2929
Cholesky Kernel
29
(I,K,J ,L)
IK
J
Plane: L=0
I
KL
Loop Projections
30
Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0C$DOISV
DOALL 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF1 CONTINUE
(L,I,K,J)
31
CFD application Computation Fluid Dynamics CFD
Navier-Stokes equations Successive Over-Relaxation SOR Kernel 3D loop: difficult to analyze
172 array references/iteration33 if-branches/iteration
Unimodular transformation found!
32
CFD Application
Range:I1’= 6,24I2’= 1, 4I3’= 1, 4
Plane: i1’=9
Seq. timeDOALL i2’,i3’
Dataflow: 19, Speedup: 3.37Loop time:19,Speedup: 3.37
I1’
I2’
I3’
(9,1,1)
(9,2,1)
(9,1,2)
Range:i1= 1, 4i2= 1, 4i3= 1, 4
Plane: 3 i1+2 i2+i3=9
Seq. time: 64 Dataflow: 19, Speedup: 3.37Loop time: 64, Speedup: 1.00
i1
i2
i3
(2,1,1)
(1,2,2)
(1,1,4)
3
2
1
0
1
0
1
0
0
33
Conclusion and Future work Allowing the exact visualization of real
program loops Assistance with detecting parallel loops Estimation of maximal speedup using
dataflow execution Assistance with finding suitable loop
transformations Future work:
Seemless Integration into PPT (parallel programming environment)
34
THANKS For you attention!
Any question?