Upload
charleen-williamson
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Performance Optimization
Getting your programs to run fasterCS 691
Why optimize
Better turn-around on jobsRun more programs/scenariosRelease resources to other applicationsYou want the job to finish before you retire
Ways to get more performance
Run on bigger, faster hardware clock speed, more memory, …
Tweak your algorithmOptimize your code
Loop Unrolling
Converting passes of a loop into in-line streams of codeUseful when loops do calculations on data in arraysUnrolling can take advantage of pipeline processing units in processorsCompiler may preload operands into CPU registers
Loop Unrolling – disadvantages
may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128
Loop Unrolling – simple example
Loop
do i=1,n
a(i) = b(i) +x*c(i)
enddo
Unrolled Loop
do i=1,n,4
a(i) = b(i) +x*c(i)
a(i+1) = b(i+1) +x*c(i+1)
a(i+2) = b(i+2) +x*c(i+2)
a(i+3) = b(i+3) +x*c(i+3)
enddo
Loop Unrolling – simple example
Performance – RolledP3 550mhz – 13 mflopsItanium – 30 mflops
Performance UnrolledP3 550mhz – 30 mflopsItanium – 107 mflops
*from: LCI and NCSA
Loop Unrolling
int a[100];
for (i=0;i<100;i++){
a[i] = a[i] * 2;
}
int a[100];
for (i=0;i<100;i+=5){
a[i] = a[i] * 2;
a[i+1]=a[i+1]*2;
a[i+2]=a[i+2]*2;
a[i+3]=a[i+3]*2;
a[i+4]=a[i+4]*2;
}
Loop unrolling
int a[10][10];
for (i=0;i<10;i++){
for (j=0;j<10;j++) {
a[i][j] = a[i][j] *2;
} }
int a[10][10];for (i=0;i<10;i++){
a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2;
a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2;
a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2;
a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2;
a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;
} }
Loop unrolling – Matrix Dot Product
float a[100];
float b[100];
float z;
for (i=0;i<100;i++){
z = z + a[i] * b[i];
}
float a[100];float b[100];float z;for (i=0;i<100;i+=2){
z = z + a[i] * b[i];z = z + a[i+1] *
b[i+1];}
Unrolling Loops
You can do it automatically
Unrolling Loops – compiler options
GNU Compilers -funroll-loops -funrull-all-loops (not recommended)
PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M
Unrolling Loops – Compiler Options
Intel Compilers -unrollM (up to M times) -unroll
Taking Memory in Order
Optimizing the use of cacherow major order vs column major order row major --
a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –
a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…
Taking Memory in Order
Remember C and Fortran store arrays in the
opposite mannerC – row majorFortran – column major
Taking Memory in Order
c
Fortran
Taking Memory in Order
do i=1,m
do j=1,n
a(i,j)=b(i,j)+c(i)
end do
end do
do j=1,m
do i=1,n
a(i,j)=b(i,j)+c(i)
end do
end do
•loop time: 23.42
•loop runs at 4.48 Mflops
•loop time: 2.80
•loop runs at 37.48 Mflops
Floating Point Division
FP Division is very expensive in terms of processor time20-60 clock cycles to computeUsually not pipelinedFP Division required by IEEE “rules”
Floating point division – use reciprocal float a[100];
for (i=0;i<100;i++){
a[i]=a[i]/2;
}
float a[100];
Float denom;
denom = 1/2;
for (i=0;i<100;i++){
a[i]=a[i]*denom;
}
Compiler options for IEEE Compatibility PGI Compilers
-Knoieee Intel Compilers
-mp GNU Compilers
can’t do
Floating Point Division
Floating Point Division
Compilers can’t optimize if divisor is not scalarBreaks IEEE “rules” May impact portability
Function Inlining
Build functions/subroutines in as inline parts of the programs code…… rather than functions/subroutines minimizes functions calls (and
management of…)
Function Inlining
Compile with – -Minline
compiler tries to inline what it can (meet compiler criteria)
-Minline=except:func excludes func from inlining
-Minline=func inline only func
Function Inlining
…Compile with- -Minline=myfile.lib
inlines functions from inline library file -Minline=levels:n
inlines functions up to n levels of calls usually default = 1
MPI Tuning
Minimize messagesPointers/countsMPI Derived datatypesMPI_Pack/MPI_UnpackUsing shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.
Compiler optimizations
-O0 –no optimization-O1 –local optimization, register allocation-O2 –local/limited global optimization-O3 –aggressive global optimization-Munroll – loop unrolling-Mvect - vectorization-Minline – function inlining
gcc Compiler Optimatizations
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
See: