Performance Optimization Getting your programs to run faster CS 691

Performance Optimization

Getting your programs to run fasterCS 691

Why optimize

Better turn-around on jobsRun more programs/scenariosRelease resources to other applicationsYou want the job to finish before you retire

Ways to get more performance

Run on bigger, faster hardware clock speed, more memory, …

Tweak your algorithmOptimize your code

Loop Unrolling

Converting passes of a loop into in-line streams of codeUseful when loops do calculations on data in arraysUnrolling can take advantage of pipeline processing units in processorsCompiler may preload operands into CPU registers

Loop Unrolling – disadvantages

may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128

Loop Unrolling – simple example

Loop

do i=1,n

a(i) = b(i) +x*c(i)

enddo

Unrolled Loop

do i=1,n,4

a(i) = b(i) +x*c(i)

a(i+1) = b(i+1) +x*c(i+1)

a(i+2) = b(i+2) +x*c(i+2)

a(i+3) = b(i+3) +x*c(i+3)

enddo

Loop Unrolling – simple example

Performance – RolledP3 550mhz – 13 mflopsItanium – 30 mflops

Performance UnrolledP3 550mhz – 30 mflopsItanium – 107 mflops

*from: LCI and NCSA

Loop Unrolling

int a[100];

for (i=0;i<100;i++){

a[i] = a[i] * 2;

}

int a[100];

for (i=0;i<100;i+=5){

a[i] = a[i] * 2;

a[i+1]=a[i+1]*2;

a[i+2]=a[i+2]*2;

a[i+3]=a[i+3]*2;

a[i+4]=a[i+4]*2;

}

Loop unrolling

int a[10][10];

for (i=0;i<10;i++){

for (j=0;j<10;j++) {

a[i][j] = a[i][j] *2;

} }

int a[10][10];for (i=0;i<10;i++){

a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2;

a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2;

a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2;

a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2;

a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;

} }

Loop unrolling – Matrix Dot Product

float a[100];

float b[100];

float z;

for (i=0;i<100;i++){

z = z + a[i] * b[i];

}

float a[100];float b[100];float z;for (i=0;i<100;i+=2){

z = z + a[i] * b[i];z = z + a[i+1] *

b[i+1];}

Unrolling Loops

You can do it automatically

Unrolling Loops – compiler options

GNU Compilers -funroll-loops -funrull-all-loops (not recommended)

PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M

Unrolling Loops – Compiler Options

Intel Compilers -unrollM (up to M times) -unroll

Taking Memory in Order

Optimizing the use of cacherow major order vs column major order row major --

a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –

a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…


Remember C and Fortran store arrays in the

opposite mannerC – row majorFortran – column major


c

Fortran


do i=1,m

do j=1,n

a(i,j)=b(i,j)+c(i)

end do

end do

do j=1,m

do i=1,n

a(i,j)=b(i,j)+c(i)

end do

end do

•loop time: 23.42

•loop runs at 4.48 Mflops

•loop time: 2.80

•loop runs at 37.48 Mflops

Floating Point Division

FP Division is very expensive in terms of processor time20-60 clock cycles to computeUsually not pipelinedFP Division required by IEEE “rules”

Floating point division – use reciprocal float a[100];

for (i=0;i<100;i++){

a[i]=a[i]/2;

}

float a[100];

Float denom;

denom = 1/2;

for (i=0;i<100;i++){

a[i]=a[i]*denom;

}

Compiler options for IEEE Compatibility PGI Compilers

-Knoieee Intel Compilers

-mp GNU Compilers

can’t do



Compilers can’t optimize if divisor is not scalarBreaks IEEE “rules” May impact portability

Function Inlining

Build functions/subroutines in as inline parts of the programs code…… rather than functions/subroutines minimizes functions calls (and

management of…)

Function Inlining

Compile with – -Minline

compiler tries to inline what it can (meet compiler criteria)

-Minline=except:func excludes func from inlining

-Minline=func inline only func

Function Inlining

…Compile with- -Minline=myfile.lib

inlines functions from inline library file -Minline=levels:n

inlines functions up to n levels of calls usually default = 1

MPI Tuning

Minimize messagesPointers/countsMPI Derived datatypesMPI_Pack/MPI_UnpackUsing shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.

Compiler optimizations

-O0 –no optimization-O1 –local optimization, register allocation-O2 –local/limited global optimization-O3 –aggressive global optimization-Munroll – loop unrolling-Mvect - vectorization-Minline – function inlining

gcc Compiler Optimatizations

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

See:

Documents

Performance Optimization Getting your programs to run faster CS 691