Agenda - ScicomP1 © 2004 IBM Charles Grassl IBM August, 2004 pSeries Optimization 2 © 2004 IBM Corporation Agenda Optimization – Strategies and techniques – Bandwidth

1

© 2004 IBM

Charles GrasslIBM

August, 2004

pSeries Optimization

2 © 2004 IBM Corporation

Agenda

Optimization– Strategies and techniques– Bandwidth Exploitation– Pipelining

2


POWER4 Performance Expectations

Floating point: –Hundreds to 1000’s Mflop/s

–Limitations:– System bandwidth– Program model

– Floating point operations

– Adds and Multiplies– Divides

– Memory access– Copying

Bandwidth:–0.100 - 5 Gbyte/s–Limitations:

– Strided access is slow

–Enhancers:– Multiple streams


Performance Expectations Example: program POP

Structured Fortran 90Data movementLow Computational IntensityLow FMA Percentage

0.87Computation intensity53 %FMA percentage372 Mflop/sFloat point instruction rate0.3HW Float points Inst. per Cyle0.9Instructions per cycle

3


Program POP:Computation Time Distribution

0

2

4

6

8

10

12%

tim

e

Subroutine


Performance Expectations Example: program SPPM

Fortran 77OptimizedHigh Computational IntensityVector intrinsicsLow FMA Percentage

1.8Computation intensity55 %FMA percentage979 Mflop/sFloat point instruction rate0.7HW Float points Inst. per Cycle1.2Instructions per cycle

4


Program SPPM:Computation Time Distribution

0

10

20

30

40

50

60%

Tim

e

Subroutine


Bandwidth Exploitation

Computational intensity–Increase number of flop/s per memory reference

Blocking–Reuse cache

Load streams–Expose up to 8 memory patterns

5


Computational Intensity Examples

SUM0.33A(i)=B(i)+C(i)

AXPY0.67A(i)=A(i)+s*B(i)

Scale1A(i)=r*B(i)+s

Triad1A(i)=r*B(i)+s*C(i)

Polynomial2A(i)=r+B(i)*(s+t*B(i))

CommentComp. Inten.Loop


Computational Intensity Tests

0100200300400500600700800900

1000

Mflo

p/s

32.3321.51.3310.670.50.33

Data from 1.3 GHz p690

6


Computational Intensity

Nominal bandwidth is 2.5 Gbyte/s for single line loops–300 Mword/s–...–300 Mflop/s for Comp. Intens. of 1–600 Mflop/s for Comp. Intens. of 2–...

Performance is (often) limited by bandwidth–NOT functional unit performance


Computational Intensity Strategy: Loop Unrolling

Outer Loop Strategy:–Increase computational intensity–Minimizes load/stores

Find variable which is constant with respect to outer loop–Unroll such that this variable is loaded once, but used multiple times

7


Outer Loop Unrolling Example

DO I = 1, NDO J = 1, Ns = s +X(J)*A(J,I)

END DOEND DO

2 flops / 2 loadsComp. Int.: 1

DO I= 1, N, 4DO J = 1, N

s = s + X(J)*A(J,I+0)+X(J)*A(J,I+1)+X(J)*A(J,I+2)+X(J)*A(J,I+3)

END DOEND DO

8 flops / 5 LoadsComp. Int.: 1.6


Outer Loop Unroll Test

0

100

200

300

400

500

600

Mflop/s

-O3 -O3 -qhot

Unroll124812

8


Outer Loop Unroll Analysis

Strategy:–Expose 8 prefetch streams

–Unroll up to 8 times–Compiler:

–Unrolls 4 times–Near optimal performance–Combines inner and outer loop unrolling


Inner Loop Unrolling Strategies

Inner loop strategy:– Reduces data dependency– Eliminate intermediate loads and stores– Expose functional units– Expose registers

Examples:– Linear recurrences– Simple loops

– Single operation

9


Software Pipelining

Exploit registers–Number of registers in POWER4 is critical–Some assistance from rename registers

Compiler does unrolling at -O3 and higher–User unrolling can conflict with compiler


Software Pipelining

do i=1,nsum = sum + X(i)

end do

do i=1,n-3,4sum1 = sum1 + X(i )sum2 = sum2 + X(i+1)sum3 = sum3 + X(i+2)sum4 = sum4 + X(i+3)

end dosum=sum1+sum2+

sum3+sum4

Explicit unrolling:Expose more variables for register use

10


Software Pipelining Example

0100200300400500600700800900

Mflo

p/s

1 2 4 6 8 12Unroll Factor

-O2



0100200300400500600700800900

1000

Mflo

p/s


-O3

11



0100200300400500600700800900

1000M

flop/

s


-O2-O3



Conclusion:Avoid explicit user unrolling at -O3Allow compiler to perform unrolling

0

100

200

300

400

500

600

700

800

900

1000

1 2 4 6 8 12

12


Computational Intensity: Effect of Cache

Large caches alleviate memory bandwidth bottleneck–Store parts of computation stream in cache–Reduce shared memory contention

Hpmcount group 59 reports “load and stores” from processor–Discounts cache loads

Need extension to concept of computational intensity


Computational Intensity: Effect of Cache

Example:…Do j=1,m

call DAXPY(..A(1,k),.,A(1,j),.)End do…Subroutine daxpy(…) Do i=1,n

X(i) = X(i) + t*Y(i)End do

Computational intensity is actually 2 flops/2 mem ref.Computational Intensity: 1.

A(:,k) A(:,j)

13


Loop of DAXPY

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500

Mop

/sBandwidth

Computation

Bandwith w/o cache

Computation w/ocache

Results from 1.5 GHz POWER4+


Memory Access Strides

Strided memory access:–Fewer words per cache line–Reduced cache line utilization–Reduced bandwidth–Memory is access by cache line–16 REAL*8 or double words

14


Stride Test Bandwidth

0

500

1000

1500

2000

2500M

byte

/s

-16-14-12-10 -8 -6 -4 -2 -1 1 2 4 6 8 10 12 14 16Stride

1.3 GHz POWER4


Strides and TBL

Strided memory access:–Fewer memory reference per memory page–Increased TLB misses

TLB signature obtained from:–hpmcount -g 59–Memory references per TLB

do i=1,mdo j=1,n

A(i,j)=A(i,j)+...end do

end do

15


Large Stride Test

050

100150200250300350400450

Mby

te/s

8 32 128 512 2048Stride

CacheTLB


Cache Blocking

Common technique in Linear Algebra–Similar to unrolling–Utilize cache lines–Linear Algebra NB:–Typically 96-256

16


Blocking

Blocking


Blocking

do i = 1,ndo j = 1,mB(j,i) = A(i,j)

end doend do

do j1 = 1,n-nb+1,nbj2 = min(j1+nb-1,n)do i1 = 1,m-nb+1,nb

i2 = min(i1+nb-1,m)do i = i1, i2do j = j1, j2

B(j,i) = A(i,j)end do

end doend do

end do

17


Blocking Example: Transpose

0200400600800

10001200140016001800

Mby

tes

1 32 64 128 192 dgetmoBlocking Factor


Prefetch Strategies

Merge loops– Combine conforming loops

–Compiler can do much of thisFolding– Useful for very long loops

–Compiler can do much of this

18


Merge Loops

Overlap cache line fetchesExpose memory prefetching

for (j=1; j

19


Fold Loops

Increase number of streams at the expense of loop length

do i = 1,nsum = sum +A(i)

end do

One stream per loop

do i = 1,n/4sum = sum +A(i )

+ A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)

end do

Four streams


Folding Example

0

500

1000

1500

2000

2500

3000

3500

4000

Mflo

p/s

Unroll Factor

0246812

1.3 GHz POWER4 Single RHS; size 10000000

20


Folding Example

0100020003000400050006000700080009000

Mflo

p/s

1000 13894 2E+05 3E+06Unroll Factor

02468

1.3 GHz POWER4


Folding Example

Beneficial for long loops– Works at ~100,000 loads

Do not fold past factor of 8

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1000 13894 193069 2682695

21


Stride Folding

Fold loop to increase number out standing cache loads:

do i = 1,n,msum = sum +A(i)

end do

do i = 1,n/4,msum = sum +A(i )

+ A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)

end do


Stride Folding Example

0

50

100

150

200

250

300

350

Mflo

p/s

Unroll Factor

024681216

1.3 GHz POWER4 Single RHS; size 5,000,000, stride 32

22


Stride Folding Example

Overlap out standing loads– Not limited by number of

stream buffersMore than 8 RHS

0

50

10 0

150

2 0 0

2 50

3 0 0

3 50


Managing Pipelines

Deep pipeline functional units– FMA– Divide– Square Root

23


Divide and Square Root

POWER4 special functions:– Divide– Sqrt

Use FMA functional unit– 2 simultaneous divide or sqrt (or rsqrt)– NOT pipelined

4040Sqrt3333Divide66Fma

DoubleSingleInstruction


Hardware DIV, SQRT, RSQRT

0

20

40

60

80

100

120

Mop

/s

Divide

SQRT

RSQRT

1.3 GHz p690

24


Hardware and Pipelined DIV, SQRT, RSQRT

0

20

40

60

80

100

120

Mop

/sDividePipeline DivideSQRTPipeline SQRTRSQRTPipeline RSQRT

1.3 GHz p690


Vectorization Analysis

DependenciesCompiler overhead:– Generate (malloc) local temporary arrays– Extra memory traffic

Moderate vector lengths required

302545N 1/2

258020Cross Over

RSQRTSQRTREC

25


Function Timings

PipelinesHardware or Scalar Function

77349288LOG833213200EXP90293966RSQRT84316936SQRT130208132ReciprocalRateClocksRateClocksFunction


Function Rates

0

20

40

60

80

100

120

140

Mop

/s

Reciprocal SQRT RSQRT EXP LOG

ScalarVector

26


SQRT

POWER4 has hardware SQRT availableDefault -qarch=comm uses software libraryUse: -qarch=pwr4

0

10

20

30

40

50

60

70

80

90

SQRT

commpwr4-qhot


Divide

IEEE divide specifies actual divideDo not use multiply by reciprocal (default)Optimize with -O3

do i=1,nB(i) = A(i)/s

end do

rs=1/sdo i=1,n

B(i) = A(i)*rsend do

27


Divide

0

20

40

60

80

100

120

140M

op/s

DIV

-O2-O2 -qunroll-O3

1.1 GHz POWER4


Vector Intrinsics

-qhot generates "vector" call to vector intrinsic functionsMonitor with "-qreport=hotlist"

do i=1,nB(i) = func(A(i))

end docall __vfunc(B,A,n)

28


Example of Pipelined Functions

do i = -nbdy+3,n+nbdy-1prl = qrprl(i)pll = qrmrl(i)pavg = vtmp1(i)wllfac(i) = 5*gammp1*pavg + gamma * pllwrlfac(i) = 5*gammp1 * pavg +gamma *prlhrholl = rho(1,i-1)hrhorl = rho(1,i)wll(i) = 1/sqrt(hrholl * wllfac(i))wrl(i) = 1/sqrt(hrhorl * wrlfac(i))

end do


Example of Pipelined Functions

allocate(t1,n+2*nbdy-3)allocate(t2,n+2*nbdy-3)do i = -nbdy+3,n+nbdy-1

prl = qrprl(i)...t1(i) =hrholl * wllfac(i)t2(i) =hrhorl * wrlfac(i)

end docall __vrsqrt(t1,wrl,n+2*nbdy-3)call __vrsqrt(t2,wll,n+2*nbdy-3)

29


Vector Intrinsics

0102030405060708090

100M

op/s

REC SQRT RSQRT EXP LOG

-O3-O3 -qhot

1.3 GHz POWER4


32-bit versus 64-bit Addressing

64-bit address mode enable use of 64-bit integer arithmeticInteger Arithmetic, especially (kind=8), or long, is much faster with -q64

30


Integer Arithmetic

0

200

400

600

800

1000

1200

1400

1600M

op/s

I4*I4 I4*I8 I8*I8

-q32-q64


32-bit Floating Point Arithmetic

FasterLess bandwidth requiredArithmetic operations are same speed as 64-bitMore efficient use of cache

Documents

Agenda - ScicomP1 © 2004 IBM Charles Grassl IBM August, 2004 pSeries Optimization 2 © 2004 IBM Corporation Agenda Optimization – Strategies and techniques – Bandwidth