Upload
brice-simmons
View
213
Download
0
Embed Size (px)
Citation preview
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 1
Computer Systems
Optimizing program performance
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 2
Performance can make the difference
Recommendations Patrick van der Smagt in 1991 for neural net implementations
• Use Pointers instead of array indices• Use doubles instead of floats• Optimize inner loops
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 3
Performance gain
• A factor of 10 can easily be gained• We have now knowledge how programs are
executed:– Load / Use hazards
(20% of load instr. → 1 bubble)– Mispredicted branches
(40% of jmp instr. → 2 bubbles)– Return from procedure calls
(100% of ret instr. → 3 bubbles)
→ Directions for optimizing procedures and loops! Gain has to be measured
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 4
Amdahl's Law
• If a part of the system initially consumed of the execution time, speeding up this part of the code with factor k, the overall factor S is much less
ka
S
)1(
1
When we speed up a part of a program, the effect on the overall performance is limited by the significance of that part
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 5
Recipe for optimizing
• Use Profile to find most used procedure
• Optimize inner-loop of that procedure
for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 6
Optimizing Compilers
• Provide efficient mapping to machine– register allocation– code selection and ordering– eliminating minor inefficiencies
• Have difficulty with “optimization blockers”– potential memory aliasing– potential procedure side-effects
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 7
Manual solution
• Code movement
for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];
for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}
Most compilers do a good job with array code Most compilers do a good job with array code + simple loop structures+ simple loop structures
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 8
– As long as no optimization blockers are present, compilers can’t be beaten
for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];
imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<n
for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j];}
Compilers solution
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 9
Memory Aliasing
• Twiddle (&xp, &xp)– Twiddle1: 4x xp– Twiddle2: 3x xp
void twiddle1 (int *xp, int *yp){
*xp += *yp;*xp += *yp:
}
void twiddle2 (int *xp, int *yp){
*xp += 2* *yp;}
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 10
Side effects
• f(x){return counter++;} → Func (0)– Func1 = 0+1+2+3=6– Func2 = 4* 0=0
int func1 (int x){
return f(x)+f(x)+f(x)+f(x);}
int func2 (int x){
return 4* f(x);}
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 11
Limitations for Compilers
• Operate Under Fundamental Constraint– Must not cause any change in program behavior under any
possible condition– Often prevents it from making optimizations when would
only affect behavior under pathological conditions.• Behavior that may be obvious to the programmer
can be obfuscated by languages and coding styles– e.g., data ranges may be more limited than variable types
suggest• Most analysis is performed only within procedures
– whole-program analysis is too expensive in most cases• Most analysis is based only on static information
– compiler has difficulty anticipating run-time inputs• When in doubt, the compiler must be conservative
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 12
– Optimizations you should do regardless of processor / compiler
• Code Motion (out of the loop)• Reducing procedure calls• Unneeded Memory usage• Share Common sub-expressions
– Machine-Dependent Optimizations• Pointer code• Unrolling• Enabling instruction level parallelism
Machine-independent versus Machine-dependent optimizations
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 13
Optimization Example
• Procedure– Compute aggregate OPER of all elements of vector– Store result at destination location
• Integer addition: Clock Cycles / Element– 42.06 (Compiled -g) 31.25 (Compiled -O2)
void combine1(vec_ptr v, data_t *dest){ int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 14
Move Call Out of Loop
• Optimization– Move call to vec_length out of inner loop
• Value does not change from one iteration to next• Function calls are expensive
– CPE: 20.66 (Compiled -O2)• vec_length() requires 10 clock cycles
void combine2(vec_ptr v, data_t *dest){ int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val;} }
int vec_length(vec_ptr v){ return v->len;}
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 15
Bypass data-abstraction
• Optimization– Avoid procedure call to retrieve each vector element
• Get pointer to start of array before loop• Within loop just do pointer reference• Not as clean in terms of data abstraction
– CPE: 6.00 (Compiled -O2)•get_vec_element() requires 14 clock cycles• Bounds checking is expensive
void combine3(vec_ptr v, data_t *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OPER data[i];}
intget_vec_element(){ if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1;}
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 16
Eliminate Unneeded Memory Refs
• Optimization– Don’t need to store in destination until end– Local variable sum held in register– Avoids 1 memory read, 1 memory write per cycle– CPE: 2.00 (Compiled -O2)
• Memory references are expensive!
void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = IDENT; for (i = 0; i < length; i++) sum = sum OPER data[i]; *dest = sum;}
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 17
Why did the compiler do that?
• Different behavior due to memory aliasing– Combine (v, get_vec_start(v)+2) with OPER *– Combine3
[2,3,5]→[2,3,1] →[2,3,2] →[2,3,6] →[2,3,36]– Combine4
[2,3,5]→[2,3,5] →[2,3,5] →[2,3,5] →[2,3,30]
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 18
Machine Independent
• Code Motion– Reduce frequency with which computation
performed• If it will always produce same result• Especially moving expensive code out of loop
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 19
How should I write my programs, given that I have a good, optimizing compiler?
• Don’t: Smash Code into Oblivion– Hard to read, maintain, & assure correctness
• Do:– Select best algorithm & data representation– Write code that’s readable & maintainable
• Procedures, recursion, without built-in constant limits• Even though these factors can slow down code
• Focus on Inner Loops– Detailed optimization means detailed measurement
Conclusion
University of Amsterdam
Computer Systems – optimizing program performance Arnoud Visser 20
Assignment
• Practice Problems– Practice Problem 5.1:
'What effect has the call swap(&xp, &xp)?‘– Practice Problem 5.3:
‘Indicate the number of functions calls in 3 fragments‘