Loop Tiling for Iterative Stencil Computations

Loop Tiling for Iterative Stencil Computations

Marta Jiménez

What is an Iterative Stencil Computation?

• ISC often performed for PDE, GM, IP– swim, tomcatv, mgrid (from SPEC95 benchmark)

– Jacobi

DO K = 1, NITER /* time-step loop */ do J = ... do I = ... {A(I,J), A(I+1,J),…} enddo enddo {wrapped-around computations}ENDDO

Matrix A

Loop Tiling• Loop Tiling

– divides IS into regular tiles to make the working set fit in the memory level being exploited

– can be applied hierarchically (Multilevel Tiling)

• Current algorithms for Loop Tiling are limited to loops that:– are “perfectly” nested

– are fully permutable

– define a rectangular IS

• However, in iterative stencil computations, loops are:– NOT perfectly nested

– NOT fully permutable

• Show how Loop Tiling can be applied to iterative stencil computations– based on Song & Li’s paper [PLDI99]

• define a Program Model• 1 Level of 1D-Tiling (cache)

– program example: SWIM• 2 levels of Tiling

– 2D-Tiling at the cache level

– 1D-Tiling at the register level (based on Jiménez et al. [ICS98][HPCA98])

• Performance Results– Loop Tiling on EV5 & EV6

Today’s talk

Steps

1- Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li

2- Perform 2D-Tiling for the Cache Level

3- Perform 1D-Tiling for the Register Level

1st Step: achieve desired program model

DO K = 1, NITER /* time-step loop */ do J1 = LJ1, UJ1

do I1 = LI1, UI1

{A(I,J), A(I+1,J),…} enddo enddo . . . do Jm = LJm, UJm

do Im = LIm, UIm

{A(I,J), A(I+1,J),…} enddo enddo

ENDDO

Program Model:

Usually, programs are NOT directly written in this form – We must apply a set of transformations to achieve this program model

SWIM original code

initializations90 NCYCLE = NCYCLE +1

CALL CALC1

CALL CALC2

IF (NCYCLE >= ITMAX) STOP

IF (NCYCLE <= 1) THEN

CALL CALC3Z

ELSE

CALL CALC3

ENDIF

GO TO 90

Transformations–Inline subroutines

–Convert GO TO into DO-loop

–Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE

SUBROUTINE CALCX do J = 1,N do I = 1,M ... enddo enddoc wrapped-around computations do J = 1, N ... enddo do I = 1, M ... enddo

...

Wrapped-around Computations

DO K = 2, ITMAX-1 do J = 1,N do I = 1,M ... enddo enddo

wrapped-around comp do J = 1, N ... enddo do I = 1, M ... enddo ... do J = 1,N do I = 1,M ... enddo enddo ...

...ENDDO

J

I I

J

CALC1

CALC2

CALC3

Projection along direction I

DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo...ENDDO

c

J


c

Another way of dealing with the wrapped-around computations is performing code sinking

DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around

do J = 1,N ... enddo

wrapped-around

ENDDO

J

1st Step: achieved program model Flow dependencies & iterations space for SWIM (Projection along direction I )

CALC1

CALC2

CALC3

K-loop(time)

K=2

K=3

1 N

Steps

1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li



1D-Tiling

K=2

K=3

K=4

J1 N

Dependencies are violated Tiling parameters: SLOPE, OFFSETS-i

SLOPE

OFFSET-i

J

1 N1 N

2D-Tiling

K (time-step loop)

J

I

1

M

N1

1

M

1

M

N1 N1

Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension (J and I) Computed using the JI-loop distance subgraph

N1 N1 N1

1

M

1

M

1

M

flow dependenciesanti-dependenciesoutput dependencies

JI3-loopJI2-loopJI1-loop

[1,-1,0][1,0,-1]

[1,-1,-1]

[1,0,0]

[1, 0, 0][1, 0, 0][1, 0, 0]

[0,0,0]

[1,-1,0][1,0,-1]

[1,0,-1][1,-1,0]

[0,0,0]

JI-loop Distance Subgraph

Each node represents a JI-loop nest Each edge represents a dependence (distance vector)

SWIM: Projection along direction I


Backward dependencies with large distances make Tiling not profitable

– apply Circular Loop Skewing to shorten backward dependencies

DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around

do J = 1,N ... enddo

wrapped-around

ENDDO

K-loop(time)

K=2

K=3

1 N

J

Shorts backward dependencies by changing the iteration order

Circular Loop Skewing

1 N

J

CLS parameters: BETA-i, DELTA (computed using the JI-loop distance subgraph)

K=2

K=3

1 N

J1 42 3

BETA-i

DELTA

22

Circular Loop Skewing

DO K = 2, ITMAX-1 do JX = 1+BETA1+DELTA(K-2),

N+BETA1+DELTA(K-2)

J = MOD(JX-1, N) + 1

... enddo wrapped-around do JX = 1+BETA2+DELTA(K-2),

N+BETA2+DELTA(K-2)

J = MOD(JX-1, N) + 1 ... enddo wrapped-around

do JX = 1+BETA3+DELTA(K-2),

N+BETA3+DELTA(K-2)

J = MOD(JX-1, N) + 1 ... enddo wrapped-around

ENDDO

K=2

K=3

1 N

J1 42 3

BETA-i

DELTA

DO JJ = ... DO II = ... DO K = ... if (first tile) then do JX = ... offsets iter. enddo endif do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo

ENDDO

SWIM: projection along direction I CLS parameters: DELTA=2, BETA1=0, BETA2=1, BETA3=2 Tiling parameters: SLOPE=2, OFFSET1=1, OFFSET2=OFFSET3=0

2nd Step: 2D-Tiling for cache level

J

31 2 N 31 2

K=2

K=3

K=4

31 2 N 31 2

Steps

1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li



3rd Step: 1D-Tiling for register level

DO JJ = ... DO II = ... DO K = ... ...

do JX = LJ, UJ

J = MOD (JX-1, N)+1

do IX = LI, UI

I = MOD (IX-1, M)+1

[loop body: {I,J}]

enddo

enddo

...ENDDO

The MOD operation introduced by CLS prevents us to fully unroll the loop

Apply first Index Set Splitting to loop J

J

I

1

M

M-1

2

M-2

N 1N-1 2N-2

unrolled

Index Set Splitting ISS splits a loop into two new loops that iterate over non-intersecting portions of

the iteration space

DO JJ = ... DO II = ... DO K = ... ...

do JX = LJ, min(N,UJ)

J = JX

do IX = ...

enddo

enddo

do JX = max(N+1,LJ), UJ

J = JX-N

do IX = ...

enddo

enddo

...ENDDO

J

I

1

M

M-1

2

M-2

N 1N-1 2N-2

ISS

DO JJ = ... DO II = ... DO K = ... ...

do JX = LJ, min(N,UJ)-3+1,3

J = JX

do IX = ...

[loop body: {J}]

[loop body: {J+1}]

[loop body: {J+2}]

enddo

enddo

do JX = JX, min(N,UJ)

J = JX

do IX = ...

[loop body: {J}]

enddo

enddo

...ENDDO

J

I

1

M

M-1

2

M-2

N 1N-1 2N-2

ISS

3rd Step: 1D-Tiling for register level

Code Transformations Summary

1- Apply a set of transformations to the original program to achieve

the program model defined by Song & Li– Inline subroutines

– Convert GOTO into DO-loop

– Peel iterations of the time-step loop to eliminate IF-statements

2- Perform 2D-Tiling for the Cache Level– Construct JI-loop distance subgraph

– Compute DELTA and BETAs and apply CLS to shorten backwards dep.

– Update JI-loop distance subgraph

– Compute OFSSETs and SLOPE and tile the IS

3- Perform 1D-Tiling for the Register Level– Index Set Splitting

– Tiling in a straightforward manner

• Architecture: EV56 (500Mhz, L1:8KB, L2:96KB), EV6(500MHz, L1:64KB, L2:4MB) • Compiler Invocation:

– f77 -O5 -arch ev56 (EV5) – kf77 -O5 -arch ev6 -notransform_loop -unroll 1 (EV6)

• Programs:– 1D-Tiling for the Cache Level: loop J, TS = 4 (EV5), TS=8 (EV6)

– 2D -Tiling for the Cache Level: TSIxJ = 32x16 (EV5), TSIxJ=40x12(EV6)

– 1D-Tiling for the register level: loop J, TS=4 (EV5 & EV6)

Performance Results (SWIM)

0.5

1

1.5

2

2.5

EV6

EV5

Spe

edup

ORI ORI + RT

1D 1D + RT

2D 2D + RT

439s 658s 294s 371s 578s 296s(execution time)

1519s 1533s 1023s 999s 1009s 677sEV5

EV6

• Architecture: EV56 (500Mhz, L1:8KB, L2:96KB)

• Compiler invocations:

– base: kf77 -O5 -arch ev56

– no_prefetch: kf77 -O5 -arch ev56 -switch nolu_prefetch_fetch …..

Performance Results EV5 (SWIM)

0.5

1

1.5

2

2.5

base

no_prefetch

Speedup over ORI (base)

ORI ORI + RT

1D 1D + RT

2D 2D + RT

Spe

edup

• Architecture: EV6(500MHz, L1:64KB, L2:4MB)

• Compiler invocations:

– base: f77 -O5 -arch ev6

– no_prefetch: f77 -O5 -arch ev6 -switch nolu_prefetch_fetch …..

Performance Results EV6 (SWIM)

0

0.5

1

1.5

2

2.5

base

no_prefetch

Speedup over ORI (base)

Spe

edup

ORI ORI + RT

1D 1D + RT

2D 2D + RT

J

Code for Result Verification

DO K = 2, ITMAX-1 ... do J = 1,N ... enddo

result verification

IF (MOD(K,MPRINT).eq.0) THEN do I = do J = UCHECK = UCHECK + {UNEW(I,J)} enddo UNEW (I,I) = . . . enddo PRINTS

ENDIF do J = 1,N ... enddoENDDO

c

Apply strip-mining to loop K (only useful if MPRINT is large)

NEW in SPEC2000!!

Documents

Loop Tiling for Iterative Stencil Computations