64
© 2005 IBM Louisiana State University Baton Rouge, Louisiana Charles Grassl IBM June, 2005 pSeries Program Optimization

pSeries Program Optimization

  • Upload
    others

  • View
    55

  • Download
    0

Embed Size (px)

Citation preview

Page 1: pSeries Program Optimization

© 2005 IBM

Louisiana State UniversityBaton Rouge, Louisiana

Charles GrasslIBM

June, 2005

pSeries Program Optimization

Page 2: pSeries Program Optimization

2 © 2005 IBM Corporation

Agenda

• Programming Concerns• Tactics

Page 3: pSeries Program Optimization

3 © 2005 IBM Corporation

Understanding Performance

• Fused floating multiply add functional units•Latency•Balance

• Memory Access•Stride•Latency•Bandwidth

Page 4: pSeries Program Optimization

4 © 2005 IBM Corporation

FMA Functional Unit

• Multiply and Add• Fused• 6 clock period latency

D = A+B*C

Page 5: pSeries Program Optimization

5 © 2005 IBM Corporation

Performance Expectations

• 100's Mflop/s to 2-4 Gflop/s•Limitations:

• System bandwidth• Program model

• Floating point operations• Adds and Multiplies• Divides

• Memory access• Copying

• 100 Mbyte/s - 10 Gbyte/s•Strided access is slow•Multiple streams

Page 6: pSeries Program Optimization

6 © 2005 IBM Corporation

Performance Expectations Example:Program POP

• Structured Fortran 90• Data moves• Low Computational Intensity• Low FMA Percentage

0.869Comp. Intensity53 %FMA percentage372 Mflip/sFloat. Point Instr. + RMA rate10141 MFloat. Point Inst. + FMSs0.3HW float points Inst. per Cycle0.9Instructions per cycle2.9 MInstructions per load/store

Page 7: pSeries Program Optimization

7 © 2005 IBM Corporation

Program POP:Computation Time Distribution

02

468

1012

% T

ime

__st

ate_

mod

_MO

D_s

t

__vm

ix_r

ich_

MO

D_v

m

trace

r_up

date

clin

ic

__hm

ix_d

el2_

MO

D_h

d

__ve

rtica

l_m

ix_M

OD

_i

__ve

rtica

l_m

ix_M

OD

_i

__di

agno

stic

s_M

OD

_d

Page 8: pSeries Program Optimization

8 © 2005 IBM Corporation

Performance Expectations Example:Program SPPM

• Loose Fortran 77• Optimized• High Computational Intensity• Vector intrinsics• Low FMA Percentage

1.8Computation Intensity55 %FMA percentage979 Mflip/sFloat point Inst. + RMA rate191752Float. Point + FMSs0.7HW Float Inst. Per cycle1.2Instr. Per cycle2.9Instr. Per load/store

Page 9: pSeries Program Optimization

9 © 2005 IBM Corporation

Program SPPM:Computation Time Distribution

0102030405060

% T

ime

sppm

difu

ze

__vs

rec_

GP

inte

rf

dint

rf

__vs

rsqr

t_63

0

hydy

z

hydz

y

Page 10: pSeries Program Optimization

10 © 2005 IBM Corporation

Programming Concerns

• Superscalar design•Concurrency:

• Branches• Loop control• Procedure calls• Nested “if” statements

•Program “control” is very efficient• Cache based microprocessor

•Critical resource: memory bandwidth• Tactics:

• Increase Computational Intensity• Exploit "prefetch"

Page 11: pSeries Program Optimization

11 © 2005 IBM Corporation

Strategies for Optimization

• Exploit multiple functional units:• Expose load/store "streams"• Optimized libraries• Pipelined operations

Page 12: pSeries Program Optimization

12 © 2005 IBM Corporation

Tactics for Optimization

• Cache reuse•Blocking

• Unit stride•Use entire loaded cache lines

• Limit range of indirect addressing•Sort indirect addresses

• Increase computational intensity of the loops•Ratio of flop's to byte's

• Load streams

Page 13: pSeries Program Optimization

13 © 2005 IBM Corporation

Computational Intensity

• Ratio of floating point operations per memory references (loads and stores)

• Higher is better• Example:

for (i=0;i<n;i++)A[i] = A[i] +B[i]*s+C[i]

•Loads and stores: 3+1•Floating point operations: 3•Computational intensity: 3/4

Page 14: pSeries Program Optimization

14 © 2005 IBM Corporation

Loop Unrolling Strategy:Increase Computational Intensity

• Find variable which is constant with respect to outer loop•Unroll such that this variable is loaded once but used multiple times

• Unroll outer loop•Minimizes load/stores

Page 15: pSeries Program Optimization

15 © 2005 IBM Corporation

Outer Loop Unroll

• 2 flops / 2 loads• Comp. Int.: 1

• 8 flops / 5 Loads• Comp. Int.: 1.6

DO I = 1, NDO J = 1, N

s = s +X(J)*A(J,I)END DO

END DO

DO I= 1, N, 4DO J = 1, N

s = s +X(J)*A(J,I+0)+X(J)*A(J,I+1)+X(J)*A(J,I+2)+X(J)*A(J,I+3)

END DOEND DO

Unroll

Page 16: pSeries Program Optimization

16 © 2005 IBM Corporation

Outer Loop Unroll Test

050

100150200250300350400450500

Mflo

p/s

-O3 -O3 -qhot

Unroll124812

1.3 GHz POWER4

Page 17: pSeries Program Optimization

17 © 2005 IBM Corporation

Outer Loop Unroll Test

050

100150200250300350400450500

Mflo

p/s

1 2 4 8 12

-O3-O3 -qhot

1.3 GHz POWER4

Page 18: pSeries Program Optimization

18 © 2005 IBM Corporation

Loop Unroll Analysis

• Strategy:•Unroll up to 8 times

• Exploit 8 prefetch streams• Compiler:

•Unrolls up to 4 times• “Near” optimal performance

•Combines inner and outer loop unrolling

Page 19: pSeries Program Optimization

19 © 2005 IBM Corporation

Loop Unrolling Strategies

• Inner loop strategy:•Reduces data dependency•Eliminate intermediate loads and stores•Expose functional units•Expose registers•Examples:

• Linear recurrences• Simple loops

• Single operation

Page 20: pSeries Program Optimization

20 © 2005 IBM Corporation

Inner Loop Unroll: Dependencies

• Eliminate data dependence (half)• Eliminate intermediate loads and stores

•Compiler will do some of this at -O3 and higher

do i=2,n-1a(i+1) = a(i)*s1 + a(i-1)*s2

end do

do i=2,n-2,2a(i+1) = a(i )*s1 + a(i-1)*s2a(i+2) = a(i+1)*s1 + a(i )*s2

end do

Page 21: pSeries Program Optimization

21 © 2005 IBM Corporation

Inner Loop Unroll: Dependencies

0

50

100

150

200

250

300

350

Mflo

p/s

-qnounroll -qunroll

Unroll 1Unroll 2Unroll 4

Page 22: pSeries Program Optimization

22 © 2005 IBM Corporation

Inner Loop Unroll: Dependencies

0

50

100

150

200

250

300

350

Mflo

p/s

Unroll 1 Unroll 2 Unroll 4

-qnounroll-qunroll

Page 23: pSeries Program Optimization

23 © 2005 IBM Corporation

Inner Loop Unroll: Dependencies

• Compiler unrolling helps •Does not help manually unrolled loops• Conflicts

• Turn off compiler unrolling if manually unrolled

Page 24: pSeries Program Optimization

24 © 2005 IBM Corporation

Inner Loop Unroll: Registers and Functional Units

• Expose functional units• Expose registers

•Compiler will do some of this at -O3 and higher

do j=1,ndo i=1,m

sum = sum + X(i)*A(i,j)end do

end do

do j=1,ndo i=1,m,2

sum = sum + X(i) *A(i,j) &+ X(i+1)*A(i+1,j)

end doend do

Page 25: pSeries Program Optimization

25 © 2005 IBM Corporation

Inner Loop Unroll: Registers and Functional Units

050

100150200250300350400450500

Mflo

p/s

-qnounroll -qunroll

Unroll 1Unroll 2Unroll 4Unroll 8

1.45 GHz POWER4

Page 26: pSeries Program Optimization

26 © 2005 IBM Corporation

Inner Loop Unroll: Registers and Functional Units

050

100150200250300350400450500

Mflo

p/s

Unroll 1 Unroll 2 Unroll 4 Unroll 8

-qnounroll-qunroll

1.45 GHz POWER4

Page 27: pSeries Program Optimization

27 © 2005 IBM Corporation

Inner Loop Unroll: Registers and Functional Units

• Compiler does adequate job of unrolling

• Do not manually unroll inner loop

Page 28: pSeries Program Optimization

28 © 2005 IBM Corporation

Outer Loop Unrolling Strategies

• Expose prefetch streams•Up to 8 streams

do j=1,ndo i=1,m

sum = sum + X(i)*A(i,j)end do

end do

do j=1,n,2do i=1,m

sum = sum + X(i)*A(i,j) &+ X(i)*A(i,j+1)

end doend do

Page 29: pSeries Program Optimization

29 © 2005 IBM Corporation

Outer Loop Unroll: Streams

0100200300400500600700800

Mflo

p/s

-qnounroll -qunroll

Unroll 1Unroll 2Unroll 4Unroll 8

Page 30: pSeries Program Optimization

30 © 2005 IBM Corporation

Outer Loop Unroll: Streams

0100200300400500600700800

Mflo

p/s

Unroll 1 Unroll 2 Unroll 4 Unroll 8

-qnounroll-qunroll

Page 31: pSeries Program Optimization

31 © 2005 IBM Corporation

Outer Loop Unroll: Streams

• Compiler does a “fair”job of outer loop unrolling

• Manual unrolling can outperform compiler unrolling

Page 32: pSeries Program Optimization

32 © 2005 IBM Corporation

Strides: Cache Lines

• Stided memory accesses use partial cache lines•Reduced efficiency

Page 33: pSeries Program Optimization

33 © 2005 IBM Corporation

Memory

Strided Memory Access

for (i=0;i<n;i+=2)sum+=a[i];

Page 34: pSeries Program Optimization

34 © 2005 IBM Corporation

Strides

• Cache line size is 128 bytes• Double precision: 16 words• Single precision: 32 words

Bandwidth Reduction

1/161/32641/161/32321/161/16161/81/88¼¼4½½21x1x1

DoubleSingleStride

Page 35: pSeries Program Optimization

35 © 2005 IBM Corporation

Stride Test

0

500

1000

1500

2000

2500

3000

Mby

te/s

-16

-10 -6 -2 1 4 8 12

POWER4 1.3 GHz

Page 36: pSeries Program Optimization

36 © 2005 IBM Corporation

Correcting Stides

• Interleave code• Example:

•Real and Imaginary part arithmetic

do i=1,n… AIMAG(Z(i))

end dodo i=1,n

…REAL(Z(i))end do

do i=1,n… AIMAG(Z(i))

……REAL(Z(i))

end do

Page 37: pSeries Program Optimization

37 © 2005 IBM Corporation

Correcting Strides

• Interchange loops

do i=1,ndo j=1,m

sum=sum+A(I,j)end do

end do

Interchange

do j=1,mdo i=1,n

sum=sum+A(I,j)end do

end do

Page 38: pSeries Program Optimization

38 © 2005 IBM Corporation

Correcting Strides:Loop Interchange

020406080

100120140160

Mflo

p/s -O3

-O3 -qhotInterchanged

1.45 GHz POWER4

Page 39: pSeries Program Optimization

39 © 2005 IBM Corporation

Effect of Large Strides (TLB Misses)

050

100150200250300350400450

Mby

te/s

8 16 32 68 132

260

516

1028

2000

Large Stride

Page 40: pSeries Program Optimization

40 © 2005 IBM Corporation

Working Set Size

• Reduce spanning size of random memory accesses

• More MPI tasks usually helps

Page 41: pSeries Program Optimization

41 © 2005 IBM Corporation

Random Memory Access

010203040506070

G u

pdat

e / s

1 2 4 8 16 32 64 128 256 512Work Set Size (Mbyte)

Page 42: pSeries Program Optimization

42 © 2005 IBM Corporation

Blocking

• Common technique in Linear Algebra (LA)•Similar to unrolling•Utilize cache lines•Linear Algebra NB:

• Typically 96-256

Page 43: pSeries Program Optimization

43 © 2005 IBM Corporation

Blocking

Blocking

Page 44: pSeries Program Optimization

44 © 2005 IBM Corporation

Blocking Example: Transpose

• Especially useful for bad strides

do i = 1,ndo j = 1,mB(j,i) = A(i,j)

end doend do

do j1 = 1,n-nb+1,nbj2 = min(j1+nb-1,n)do i1 = 1,m-nb+1,nb

i2 = min(i1+nb-1,m)do i = i1, i2do j = j1, j2

B(j,i) = A(i,j)end do

end doend do

end do

Blocking

Page 45: pSeries Program Optimization

45 © 2005 IBM Corporation

Blocking Example: Transpose

0200400600800

10001200140016001800

Mby

te/s

1 32 64 128

192

dget

mi

Blocking Size

Page 46: pSeries Program Optimization

46 © 2005 IBM Corporation

Hardware Prefetch

• Detects adjacent cache line references• Forward and backward• Up to eight concurrent streams• Prefetches up to two lines ahead per stream• Twelve prefetch filter queues prevents

rolling• No prefetch on store misses • (when a store instruction causes a cache

line miss)

Page 47: pSeries Program Optimization

47 © 2005 IBM Corporation

Prefetch: Stride Pattern Recognition

• Upon a cache miss:• Biased guess is made as to the direction of

that stream• Guess is based upon where in the cache line

the address associated with that miss occurred

• If it is in the first 3/4, then the direction is guessed as ascending

• If in the last 1/4, the direction is guessed descending

Page 48: pSeries Program Optimization

48 © 2005 IBM Corporation

Memory Bandwidth

0

500

1000

1500

2000

2500

3000

Mby

te/s

1 2 4 6 8 10 12Right Hand Sides

Small Pages

1.3 GHz POWER4

Page 49: pSeries Program Optimization

49 © 2005 IBM Corporation

Memory Bandwidth

0500

100015002000250030003500400045005000

Mby

te/s

1 2 4 6 8 10 12Right Hand Sides

Large Pages

1.3 GHz POWER4

Page 50: pSeries Program Optimization

50 © 2005 IBM Corporation

Memory Bandwidth

0500

100015002000250030003500400045005000

Mby

te/s

1 2 4 6 8 10 12Right Hand Sides

Small PagesLarge Pages

1.3 GHz POWER4

Page 51: pSeries Program Optimization

51 © 2005 IBM Corporation

Memory Bandwidth

• Bandwidth is proportional to number of streams•Streams are roughly the number of right hand side arrays

•Up to eight streams

Page 52: pSeries Program Optimization

52 © 2005 IBM Corporation

Exploiting Prefetch

• Merge Loops• Strategy

•Combine loops to get up to 8 right hand sides

for (j=1; j<= n; j++)A[j] = A[j-1]+B[j]

for (j=1; j<= n; j++) D[j] = D[j+1]+C[j]*s

for (j=1; j<= n; j++){A[j] = A[j-1]+B[j]D[j] = D[j+1]+C[j]*s}

Page 53: pSeries Program Optimization

53 © 2005 IBM Corporation

Loop Merge Example

050

100150200250300350400

Mflo

p/s

Original -O3 -qhot Merged

1.3 GHz POWER41.3 GHz POWER4

Page 54: pSeries Program Optimization

54 © 2005 IBM Corporation

Folding

• Fold loop to increase number of streams:• Strategy

•Fold loops to get up to 4 times

do i = 1,nsum = sum +A(i)

end do

do i = 1,n/4sum = sum +A(i )

+ A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)

end do

Page 55: pSeries Program Optimization

55 © 2005 IBM Corporation

Folding Example

0500

1000150020002500300035004000

Mflo

p/s

0 2 4 6 8 12Folds

1st Qtr

1.3 GHz POWER4

Page 56: pSeries Program Optimization

56 © 2005 IBM Corporation

Effect of Precision

• Available floating point formats:• real (kind=4)• real (kind=8)• real (kind=16)

• Advantage of smaller data types:•Require less bandwidth•More effective cache use

REAL*8 A,pi,e…do i=1,nA(i) = pi*A(i) + e

end do

REAL*4 A,pi,e…do i=1,nA(i) = pi*A(i) + e

end do

Page 57: pSeries Program Optimization

57 © 2005 IBM Corporation

Effect of Precision

0200400600800

1000120014001600

Mflo

p/s

REAL*4 REAL*8 REAL*16

Small (10000)Large (90M)

1.3 GHz POWER41.3 GHz POWER4

Page 58: pSeries Program Optimization

58 © 2005 IBM Corporation

Divide and Sqrt

• POWER4 special functions:• Divide• Sqrt• Use FMA functional unit

• 2 simultaneous divide or sqrt (or rsqrt)• NOT pipelined

3838Fsqrt3232Fdiv66Fma

Double(Cycles)

Single(Cycles)Instruction

Page 59: pSeries Program Optimization

59 © 2005 IBM Corporation

Hardware DIV, SQRT, RSQRT

01020304050607080

Mflo

p/s

Divide: inst. SQRT: Inst. RSQRT: inst.

Hardware

1.3 GHz POWER41.3 GHz POWER4

Page 60: pSeries Program Optimization

60 © 2005 IBM Corporation

Hardware DIV, SQRT, RSQRT

0

20

40

60

80

100

120

Mflo

p/s

Divide: inst. SQRT: Inst. RSQRT:inst.

Software Pipelined

1.3 GHz POWER41.3 GHz POWER4

Page 61: pSeries Program Optimization

61 © 2005 IBM Corporation

Hardware DIV, SQRT, RSQRT

0

20

40

60

80

100

120

Mflo

p/s

Divide: inst. SQRT: Inst. RSQRT:inst.

HardwareSoftware Pipelined

1.3 GHz POWER41.3 GHz POWER4

Page 62: pSeries Program Optimization

62 © 2005 IBM Corporation

Intrinsic Function Vectorization

do i = -nbdy+3,n+nbdy-1prl = qrprl(i)pll = qrmrl(i)pavg = vtmp1(i)wllfac(i) = 5*gammp1*pavg + gamma * pllwrlfac(i) = 5*gammp1 * pavg +gamma *prlhrholl = rho(1,i-1)hrhorl = rho(1,i)wll(i) = 1/sqrt(hrholl * wllfac(i))wrl(i) = 1/sqrt(hrhorl * wrlfac(i))

end do

Page 63: pSeries Program Optimization

63 © 2005 IBM Corporation

Intrinsic Function Vectorization

allocate(t1,n+2*nbdy-3)allocate(t2,n+2*nbdy-3)do i = -nbdy+3,n+nbdy-1

prl = qrprl(i)...t1(i) =hrholl * wllfac(i)t2(i) =hrhorl * wrlfac(i)

end docall __vrsqrt(t1,wrl,n+2*nbdy-3)call __vrsqrt(t2,wll,n+2*nbdy-3)

Page 64: pSeries Program Optimization

64 © 2005 IBM Corporation

Vectorization Analysis

• Dependencies• Compiler overhead:

• Generate (malloc) local temporary arrays• Extra memory traffic

• Moderate vector lengths required

302545N½

258020CrossoverLength

RSQRTSQRTREC