30
1 © 2004 IBM Charles Grassl IBM August, 2004 pSeries Optimization 2 © 2004 IBM Corporation Agenda Optimization Strategies and techniques Bandwidth Exploitation Pipelining

Agenda - ScicomP1 © 2004 IBM Charles Grassl IBM August, 2004 pSeries Optimization 2 © 2004 IBM Corporation Agenda Optimization – Strategies and techniques – Bandwidth

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • 1

    © 2004 IBM

    Charles GrasslIBM

    August, 2004

    pSeries Optimization

    2 © 2004 IBM Corporation

    Agenda

    Optimization– Strategies and techniques– Bandwidth Exploitation– Pipelining

  • 2

    3 © 2004 IBM Corporation

    POWER4 Performance Expectations

    Floating point: –Hundreds to 1000’s Mflop/s

    –Limitations:– System bandwidth– Program model

    – Floating point operations

    – Adds and Multiplies– Divides

    – Memory access– Copying

    Bandwidth:–0.100 - 5 Gbyte/s–Limitations:

    – Strided access is slow

    –Enhancers:– Multiple streams

    4 © 2004 IBM Corporation

    Performance Expectations Example: program POP

    Structured Fortran 90Data movementLow Computational IntensityLow FMA Percentage

    0.87Computation intensity53 %FMA percentage372 Mflop/sFloat point instruction rate0.3HW Float points Inst. per Cyle0.9Instructions per cycle

  • 3

    5 © 2004 IBM Corporation

    Program POP:Computation Time Distribution

    0

    2

    4

    6

    8

    10

    12%

    tim

    e

    Subroutine

    6 © 2004 IBM Corporation

    Performance Expectations Example: program SPPM

    Fortran 77OptimizedHigh Computational IntensityVector intrinsicsLow FMA Percentage

    1.8Computation intensity55 %FMA percentage979 Mflop/sFloat point instruction rate0.7HW Float points Inst. per Cycle1.2Instructions per cycle

  • 4

    7 © 2004 IBM Corporation

    Program SPPM:Computation Time Distribution

    0

    10

    20

    30

    40

    50

    60%

    Tim

    e

    Subroutine

    8 © 2004 IBM Corporation

    Bandwidth Exploitation

    Computational intensity–Increase number of flop/s per memory reference

    Blocking–Reuse cache

    Load streams–Expose up to 8 memory patterns

  • 5

    9 © 2004 IBM Corporation

    Computational Intensity Examples

    SUM0.33A(i)=B(i)+C(i)

    AXPY0.67A(i)=A(i)+s*B(i)

    Scale1A(i)=r*B(i)+s

    Triad1A(i)=r*B(i)+s*C(i)

    Polynomial2A(i)=r+B(i)*(s+t*B(i))

    CommentComp. Inten.Loop

    10 © 2004 IBM Corporation

    Computational Intensity Tests

    0100200300400500600700800900

    1000

    Mflo

    p/s

    32.3321.51.3310.670.50.33

    Data from 1.3 GHz p690

  • 6

    11 © 2004 IBM Corporation

    Computational Intensity

    Nominal bandwidth is 2.5 Gbyte/s for single line loops–300 Mword/s–...–300 Mflop/s for Comp. Intens. of 1–600 Mflop/s for Comp. Intens. of 2–...

    Performance is (often) limited by bandwidth–NOT functional unit performance

    12 © 2004 IBM Corporation

    Computational Intensity Strategy: Loop Unrolling

    Outer Loop Strategy:–Increase computational intensity–Minimizes load/stores

    Find variable which is constant with respect to outer loop–Unroll such that this variable is loaded once, but used multiple times

  • 7

    13 © 2004 IBM Corporation

    Outer Loop Unrolling Example

    DO I = 1, NDO J = 1, Ns = s +X(J)*A(J,I)

    END DOEND DO

    2 flops / 2 loadsComp. Int.: 1

    DO I= 1, N, 4DO J = 1, N

    s = s + X(J)*A(J,I+0)+X(J)*A(J,I+1)+X(J)*A(J,I+2)+X(J)*A(J,I+3)

    END DOEND DO

    8 flops / 5 LoadsComp. Int.: 1.6

    14 © 2004 IBM Corporation

    Outer Loop Unroll Test

    0

    100

    200

    300

    400

    500

    600

    Mflop/s

    -O3 -O3 -qhot

    Unroll124812

  • 8

    15 © 2004 IBM Corporation

    Outer Loop Unroll Analysis

    Strategy:–Expose 8 prefetch streams

    –Unroll up to 8 times–Compiler:

    –Unrolls 4 times–Near optimal performance–Combines inner and outer loop unrolling

    16 © 2004 IBM Corporation

    Inner Loop Unrolling Strategies

    Inner loop strategy:– Reduces data dependency– Eliminate intermediate loads and stores– Expose functional units– Expose registers

    Examples:– Linear recurrences– Simple loops

    – Single operation

  • 9

    17 © 2004 IBM Corporation

    Software Pipelining

    Exploit registers–Number of registers in POWER4 is critical–Some assistance from rename registers

    Compiler does unrolling at -O3 and higher–User unrolling can conflict with compiler

    18 © 2004 IBM Corporation

    Software Pipelining

    do i=1,nsum = sum + X(i)

    end do

    do i=1,n-3,4sum1 = sum1 + X(i )sum2 = sum2 + X(i+1)sum3 = sum3 + X(i+2)sum4 = sum4 + X(i+3)

    end dosum=sum1+sum2+

    sum3+sum4

    Explicit unrolling:Expose more variables for register use

  • 10

    19 © 2004 IBM Corporation

    Software Pipelining Example

    0100200300400500600700800900

    Mflo

    p/s

    1 2 4 6 8 12Unroll Factor

    -O2

    20 © 2004 IBM Corporation

    Software Pipelining Example

    0100200300400500600700800900

    1000

    Mflo

    p/s

    1 2 4 6 8 12Unroll Factor

    -O3

  • 11

    21 © 2004 IBM Corporation

    Software Pipelining Example

    0100200300400500600700800900

    1000M

    flop/

    s

    1 2 4 6 8 12Unroll Factor

    -O2-O3

    22 © 2004 IBM Corporation

    Software Pipelining Example

    Conclusion:Avoid explicit user unrolling at -O3Allow compiler to perform unrolling

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    1 2 4 6 8 12

  • 12

    23 © 2004 IBM Corporation

    Computational Intensity: Effect of Cache

    Large caches alleviate memory bandwidth bottleneck–Store parts of computation stream in cache–Reduce shared memory contention

    Hpmcount group 59 reports “load and stores” from processor–Discounts cache loads

    Need extension to concept of computational intensity

    24 © 2004 IBM Corporation

    Computational Intensity: Effect of Cache

    Example:…Do j=1,m

    call DAXPY(..A(1,k),.,A(1,j),.)End do…Subroutine daxpy(…) Do i=1,n

    X(i) = X(i) + t*Y(i)End do

    Computational intensity is actually 2 flops/2 mem ref.Computational Intensity: 1.

    A(:,k) A(:,j)

  • 13

    25 © 2004 IBM Corporation

    Loop of DAXPY

    0

    500

    1000

    1500

    2000

    2500

    0 500 1000 1500 2000 2500

    Mop

    /sBandwidth

    Computation

    Bandwith w/o cache

    Computation w/ocache

    Results from 1.5 GHz POWER4+

    26 © 2004 IBM Corporation

    Memory Access Strides

    Strided memory access:–Fewer words per cache line–Reduced cache line utilization–Reduced bandwidth–Memory is access by cache line–16 REAL*8 or double words

  • 14

    27 © 2004 IBM Corporation

    Stride Test Bandwidth

    0

    500

    1000

    1500

    2000

    2500M

    byte

    /s

    -16-14-12-10 -8 -6 -4 -2 -1 1 2 4 6 8 10 12 14 16Stride

    1.3 GHz POWER4

    28 © 2004 IBM Corporation

    Strides and TBL

    Strided memory access:–Fewer memory reference per memory page–Increased TLB misses

    TLB signature obtained from:–hpmcount -g 59–Memory references per TLB

    do i=1,mdo j=1,n

    A(i,j)=A(i,j)+...end do

    end do

  • 15

    29 © 2004 IBM Corporation

    Large Stride Test

    050

    100150200250300350400450

    Mby

    te/s

    8 32 128 512 2048Stride

    CacheTLB

    30 © 2004 IBM Corporation

    Cache Blocking

    Common technique in Linear Algebra–Similar to unrolling–Utilize cache lines–Linear Algebra NB:–Typically 96-256

  • 16

    31 © 2004 IBM Corporation

    Blocking

    Blocking

    32 © 2004 IBM Corporation

    Blocking

    do i = 1,ndo j = 1,mB(j,i) = A(i,j)

    end doend do

    do j1 = 1,n-nb+1,nbj2 = min(j1+nb-1,n)do i1 = 1,m-nb+1,nb

    i2 = min(i1+nb-1,m)do i = i1, i2do j = j1, j2

    B(j,i) = A(i,j)end do

    end doend do

    end do

  • 17

    33 © 2004 IBM Corporation

    Blocking Example: Transpose

    0200400600800

    10001200140016001800

    Mby

    tes

    1 32 64 128 192 dgetmoBlocking Factor

    34 © 2004 IBM Corporation

    Prefetch Strategies

    Merge loops– Combine conforming loops

    –Compiler can do much of thisFolding– Useful for very long loops

    –Compiler can do much of this

  • 18

    35 © 2004 IBM Corporation

    Merge Loops

    Overlap cache line fetchesExpose memory prefetching

    for (j=1; j

  • 19

    37 © 2004 IBM Corporation

    Fold Loops

    Increase number of streams at the expense of loop length

    do i = 1,nsum = sum +A(i)

    end do

    One stream per loop

    do i = 1,n/4sum = sum +A(i )

    + A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)

    end do

    Four streams

    38 © 2004 IBM Corporation

    Folding Example

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Mflo

    p/s

    Unroll Factor

    0246812

    1.3 GHz POWER4 Single RHS; size 10000000

  • 20

    39 © 2004 IBM Corporation

    Folding Example

    0100020003000400050006000700080009000

    Mflo

    p/s

    1000 13894 2E+05 3E+06Unroll Factor

    02468

    1.3 GHz POWER4

    40 © 2004 IBM Corporation

    Folding Example

    Beneficial for long loops– Works at ~100,000 loads

    Do not fold past factor of 8

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    1000 13894 193069 2682695

  • 21

    41 © 2004 IBM Corporation

    Stride Folding

    Fold loop to increase number out standing cache loads:

    do i = 1,n,msum = sum +A(i)

    end do

    do i = 1,n/4,msum = sum +A(i )

    + A(i+1*n/4)+ A(i+2*n/4)+ A(i+3*n/4)

    end do

    42 © 2004 IBM Corporation

    Stride Folding Example

    0

    50

    100

    150

    200

    250

    300

    350

    Mflo

    p/s

    Unroll Factor

    024681216

    1.3 GHz POWER4 Single RHS; size 5,000,000, stride 32

  • 22

    43 © 2004 IBM Corporation

    Stride Folding Example

    Overlap out standing loads– Not limited by number of

    stream buffersMore than 8 RHS

    0

    50

    10 0

    150

    2 0 0

    2 50

    3 0 0

    3 50

    44 © 2004 IBM Corporation

    Managing Pipelines

    Deep pipeline functional units– FMA– Divide– Square Root

  • 23

    45 © 2004 IBM Corporation

    Divide and Square Root

    POWER4 special functions:– Divide– Sqrt

    Use FMA functional unit– 2 simultaneous divide or sqrt (or rsqrt)– NOT pipelined

    4040Sqrt3333Divide66Fma

    DoubleSingleInstruction

    46 © 2004 IBM Corporation

    Hardware DIV, SQRT, RSQRT

    0

    20

    40

    60

    80

    100

    120

    Mop

    /s

    Divide

    SQRT

    RSQRT

    1.3 GHz p690

  • 24

    47 © 2004 IBM Corporation

    Hardware and Pipelined DIV, SQRT, RSQRT

    0

    20

    40

    60

    80

    100

    120

    Mop

    /sDividePipeline DivideSQRTPipeline SQRTRSQRTPipeline RSQRT

    1.3 GHz p690

    48 © 2004 IBM Corporation

    Vectorization Analysis

    DependenciesCompiler overhead:– Generate (malloc) local temporary arrays– Extra memory traffic

    Moderate vector lengths required

    302545N 1/2

    258020Cross Over

    RSQRTSQRTREC

  • 25

    49 © 2004 IBM Corporation

    Function Timings

    PipelinesHardware or Scalar Function

    77349288LOG833213200EXP90293966RSQRT84316936SQRT130208132ReciprocalRateClocksRateClocksFunction

    50 © 2004 IBM Corporation

    Function Rates

    0

    20

    40

    60

    80

    100

    120

    140

    Mop

    /s

    Reciprocal SQRT RSQRT EXP LOG

    ScalarVector

  • 26

    51 © 2004 IBM Corporation

    SQRT

    POWER4 has hardware SQRT availableDefault -qarch=comm uses software libraryUse: -qarch=pwr4

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    SQRT

    commpwr4-qhot

    52 © 2004 IBM Corporation

    Divide

    IEEE divide specifies actual divideDo not use multiply by reciprocal (default)Optimize with -O3

    do i=1,nB(i) = A(i)/s

    end do

    rs=1/sdo i=1,n

    B(i) = A(i)*rsend do

  • 27

    53 © 2004 IBM Corporation

    Divide

    0

    20

    40

    60

    80

    100

    120

    140M

    op/s

    DIV

    -O2-O2 -qunroll-O3

    1.1 GHz POWER4

    54 © 2004 IBM Corporation

    Vector Intrinsics

    -qhot generates "vector" call to vector intrinsic functionsMonitor with "-qreport=hotlist"

    do i=1,nB(i) = func(A(i))

    end docall __vfunc(B,A,n)

  • 28

    55 © 2004 IBM Corporation

    Example of Pipelined Functions

    do i = -nbdy+3,n+nbdy-1prl = qrprl(i)pll = qrmrl(i)pavg = vtmp1(i)wllfac(i) = 5*gammp1*pavg + gamma * pllwrlfac(i) = 5*gammp1 * pavg +gamma *prlhrholl = rho(1,i-1)hrhorl = rho(1,i)wll(i) = 1/sqrt(hrholl * wllfac(i))wrl(i) = 1/sqrt(hrhorl * wrlfac(i))

    end do

    56 © 2004 IBM Corporation

    Example of Pipelined Functions

    allocate(t1,n+2*nbdy-3)allocate(t2,n+2*nbdy-3)do i = -nbdy+3,n+nbdy-1

    prl = qrprl(i)...t1(i) =hrholl * wllfac(i)t2(i) =hrhorl * wrlfac(i)

    end docall __vrsqrt(t1,wrl,n+2*nbdy-3)call __vrsqrt(t2,wll,n+2*nbdy-3)

  • 29

    57 © 2004 IBM Corporation

    Vector Intrinsics

    0102030405060708090

    100M

    op/s

    REC SQRT RSQRT EXP LOG

    -O3-O3 -qhot

    1.3 GHz POWER4

    58 © 2004 IBM Corporation

    32-bit versus 64-bit Addressing

    64-bit address mode enable use of 64-bit integer arithmeticInteger Arithmetic, especially (kind=8), or long, is much faster with -q64

  • 30

    59 © 2004 IBM Corporation

    Integer Arithmetic

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600M

    op/s

    I4*I4 I4*I8 I8*I8

    -q32-q64

    60 © 2004 IBM Corporation

    32-bit Floating Point Arithmetic

    FasterLess bandwidth requiredArithmetic operations are same speed as 64-bitMore efficient use of cache