20
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

Embed Size (px)

Citation preview

Page 1: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 1

Computer Systems

Optimizing program performance

Page 2: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 2

Performance can make the difference

Recommendations Patrick van der Smagt in 1991 for neural net implementations

• Use Pointers instead of array indices• Use doubles instead of floats• Optimize inner loops

Page 3: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 3

Performance gain

• A factor of 10 can easily be gained• We have now knowledge how programs are

executed:– Load / Use hazards

(20% of load instr. → 1 bubble)– Mispredicted branches

(40% of jmp instr. → 2 bubbles)– Return from procedure calls

(100% of ret instr. → 3 bubbles)

→ Directions for optimizing procedures and loops! Gain has to be measured

Page 4: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 4

Amdahl's Law

• If a part of the system initially consumed of the execution time, speeding up this part of the code with factor k, the overall factor S is much less

ka

S

)1(

1

When we speed up a part of a program, the effect on the overall performance is limited by the significance of that part

Page 5: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 5

Recipe for optimizing

• Use Profile to find most used procedure

• Optimize inner-loop of that procedure

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Page 6: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 6

Optimizing Compilers

• Provide efficient mapping to machine– register allocation– code selection and ordering– eliminating minor inefficiencies

• Have difficulty with “optimization blockers”– potential memory aliasing– potential procedure side-effects

Page 7: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 7

Manual solution

• Code movement

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}

Most compilers do a good job with array code Most compilers do a good job with array code + simple loop structures+ simple loop structures

Page 8: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 8

– As long as no optimization blockers are present, compilers can’t be beaten

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<n

for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j];}

Compilers solution

Page 9: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 9

Memory Aliasing

• Twiddle (&xp, &xp)– Twiddle1: 4x xp– Twiddle2: 3x xp

void twiddle1 (int *xp, int *yp){

*xp += *yp;*xp += *yp:

}

void twiddle2 (int *xp, int *yp){

*xp += 2* *yp;}

Page 10: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 10

Side effects

• f(x){return counter++;} → Func (0)– Func1 = 0+1+2+3=6– Func2 = 4* 0=0

int func1 (int x){

return f(x)+f(x)+f(x)+f(x);}

int func2 (int x){

return 4* f(x);}

Page 11: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 11

Limitations for Compilers

• Operate Under Fundamental Constraint– Must not cause any change in program behavior under any

possible condition– Often prevents it from making optimizations when would

only affect behavior under pathological conditions.• Behavior that may be obvious to the programmer

can be obfuscated by languages and coding styles– e.g., data ranges may be more limited than variable types

suggest• Most analysis is performed only within procedures

– whole-program analysis is too expensive in most cases• Most analysis is based only on static information

– compiler has difficulty anticipating run-time inputs• When in doubt, the compiler must be conservative

Page 12: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 12

– Optimizations you should do regardless of processor / compiler

• Code Motion (out of the loop)• Reducing procedure calls• Unneeded Memory usage• Share Common sub-expressions

– Machine-Dependent Optimizations• Pointer code• Unrolling• Enabling instruction level parallelism

Machine-independent versus Machine-dependent optimizations

Page 13: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 13

Optimization Example

• Procedure– Compute aggregate OPER of all elements of vector– Store result at destination location

• Integer addition: Clock Cycles / Element– 42.06 (Compiled -g) 31.25 (Compiled -O2)

void combine1(vec_ptr v, data_t *dest){ int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}

Page 14: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 14

Move Call Out of Loop

• Optimization– Move call to vec_length out of inner loop

• Value does not change from one iteration to next• Function calls are expensive

– CPE: 20.66 (Compiled -O2)• vec_length() requires 10 clock cycles

void combine2(vec_ptr v, data_t *dest){ int i; int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val;} }

int vec_length(vec_ptr v){ return v->len;}

Page 15: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 15

Bypass data-abstraction

• Optimization– Avoid procedure call to retrieve each vector element

• Get pointer to start of array before loop• Within loop just do pointer reference• Not as clean in terms of data abstraction

– CPE: 6.00 (Compiled -O2)•get_vec_element() requires 14 clock cycles• Bounds checking is expensive

void combine3(vec_ptr v, data_t *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OPER data[i];}

intget_vec_element(){ if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1;}

Page 16: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 16

Eliminate Unneeded Memory Refs

• Optimization– Don’t need to store in destination until end– Local variable sum held in register– Avoids 1 memory read, 1 memory write per cycle– CPE: 2.00 (Compiled -O2)

• Memory references are expensive!

void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = IDENT; for (i = 0; i < length; i++) sum = sum OPER data[i]; *dest = sum;}

Page 17: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 17

Why did the compiler do that?

• Different behavior due to memory aliasing– Combine (v, get_vec_start(v)+2) with OPER *– Combine3

[2,3,5]→[2,3,1] →[2,3,2] →[2,3,6] →[2,3,36]– Combine4

[2,3,5]→[2,3,5] →[2,3,5] →[2,3,5] →[2,3,30]

Page 18: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 18

Machine Independent

• Code Motion– Reduce frequency with which computation

performed• If it will always produce same result• Especially moving expensive code out of loop

Page 19: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 19

How should I write my programs, given that I have a good, optimizing compiler?

• Don’t: Smash Code into Oblivion– Hard to read, maintain, & assure correctness

• Do:– Select best algorithm & data representation– Write code that’s readable & maintainable

• Procedures, recursion, without built-in constant limits• Even though these factors can slow down code

• Focus on Inner Loops– Detailed optimization means detailed measurement

Conclusion

Page 20: University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance

University of Amsterdam

Computer Systems – optimizing program performance Arnoud Visser 20

Assignment

• Practice Problems– Practice Problem 5.1:

'What effect has the call swap(&xp, &xp)?‘– Practice Problem 5.3:

‘Indicate the number of functions calls in 3 fragments‘