29
Performance Issues: Algorithm Performance Issues: Algorithm Peng-Sheng Chen 1

Performance Issues: Algorithm

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Issues: Algorithm

Performance Issues:AlgorithmPerformance Issues:Algorithm

Peng-Sheng Chen

1

Page 2: Performance Issues: Algorithm

2

AlgorithmAlgorithm

A good algorithm solves a problem in a fast and efficient manner

A poor algorithm, no matter how well implemented, is never as fast

Algorithm performance can be evaluated by computational complexity, Big-O

Page 3: Performance Issues: Algorithm

3

Computational Complexity (1)Computational Complexity (1)

Complexity 256 elements1000 elements

10000 elements

Bubble Sort n^265,000

operations1,000,000 100,000,000

QuickSortn log n

(average)2,048 9,965 133,000

Page 4: Performance Issues: Algorithm

4

Computational Complexity (2)Computational Complexity (2)

Computational complexity only takes into account

loop iterations, and not all of the factors affecting an

algorithm performance

You need to consider additional things to compare

algorithms, especially when the computational

complexity are similar

Page 5: Performance Issues: Algorithm

5

Choice of Instructions (1)Choice of Instructions (1)

The instructions needed to implement an algorithm have a big impact on performance

Ex: integer addition => 1 clock

Ex: integer division => 68 clocks

How to evaluate the speed of an instruction?

Instruction latency

Instruction throughput

Page 6: Performance Issues: Algorithm

6

Choice of Instructions (2)Choice of Instructions (2)

Instruction latency

The number of clocks required to complete one

instruction after the instruction’s inputs are ready

and execution begins

Ex: integer multiplication => 9 clocks

Page 7: Performance Issues: Algorithm

7

Choice of Instructions (3)Choice of Instructions (3) Instruction throughput

The number of clocks that the processor is

required to wait before starting the execution of

an identical instruction

Instruction throughput is always less than or

equal to instruction latency

Instruction pipelining causes the differences

Ex: integer multiplication throughput => 4 clocks

A new multiply can begin execution every 4

clocks

Page 8: Performance Issues: Algorithm

8

Algorithm SelectionAlgorithm Selection

Computational complexity

Take latency and throughput into account

Ex:

Algorithm 1

10 additions

Algorithm 2

1 divide

Algorithm 1 will be faster because divides take

60 times longer than additions to execution

Page 9: Performance Issues: Algorithm

9

Example: Finding 最大公因數Example: Finding 最大公因數

Basic steps

Factor each number

Find the factors that are common between

both numbers

Multiply the common factors together to get

the greatest common divisor

Page 10: Performance Issues: Algorithm

10

Finding 最大公因數Finding 最大公因數

Ex: two numbers 40, 48

Basic steps

Factor each number

40 = 2 * 2 * 2 * 5

48 = 2 * 2 * 2 * 2 * 3

Find the common factors

2 * 2 * 2

Multiply the common factors to get the greatest common divisor

GCD = 2 * 2 * 2 = 8

It will take a long time

Page 11: Performance Issues: Algorithm

11

Euclid’s Algorithm for GCD (1)Euclid’s Algorithm for GCD (1)

Basic steps

Larger number = larger number – smaller number

If the numbers are the same, it is the greatest

common divisor, otherwise go to step 1

Page 12: Performance Issues: Algorithm

12

Euclid’s Algorithm for GCD (2)Euclid’s Algorithm for GCD (2)

Two numbers 48, 40

Basic steps

48, 40 → 48 – 40 = 8 40

8 != 40, so repeat step 1

8, 40 → 40 – 8 = 32 8

8 != 32, so repeat step 1

8, 32 → 32 – 8 = 24 8

8 != 24, so repeat step 1

8, 24 → 24 – 8 = 16 8

8 != 16, so repeat step 1

8, 16 → 16 – 8 = 8 8

8 = 8, so 8 is the GCD

Page 13: Performance Issues: Algorithm

13

Euclid’s Algorithm for GCD (3)Euclid’s Algorithm for GCD (3)

int find_gcf(int a, int b)

{

/* assumes both a and b are greater than 0 */

while (1) {

if (a > b)

a = a – b;

else if (a < b)

b = b – a;

else /* they are equal */

return a;

}

For a=48, and b=40,

5 compares, 14 branches, and 5 subtracts => a total of 24

instructions

Page 14: Performance Issues: Algorithm

14

Euclid’s Algorithm for GCD (4)Euclid’s Algorithm for GCD (4)int find_gcd(int a, int b)

{

/* assumes both a and b are greater than 0 */

while (1) {

a = a % b;

if (a == 0) return b;

if (a == 1) return 1;

b = b % a;

if (b == 0) return a;

if (b == 1) return 1;

}

} Variation of Euclid algorithm

For a=48, and b=40,

2 divides, 3 compares, 3 branches, 4 moves, and 2 cdq

instructions => a total of 14 instructions

cdq => convert double-

word to quad-word

Page 15: Performance Issues: Algorithm

15

A Variation of Euclid’s Algorithm for GCDA Variation of Euclid’s Algorithm for GCD

Two numbers a = 48, b = 40

Basic steps

a = 48 % 40 = 8

Test a equal to 0?

Test a equal to 1?

b = 40 % 8 = 0

Test b equal to 0?

Modulo version is faster than subtraction version?

Page 16: Performance Issues: Algorithm

16

A Rough ComparisonA Rough Comparison

Instruction Quantity Latency Total clocks

Subtractions 5 1 5

Compares 5 1 5

Branches 14 1 14

Other 0 1 0

Totals 24 24

Instruction Quantity Latency Total clocks

Modulo 2 68 136

Compares 3 1 3

Branches 3 1 3

Other 6 1 6

Totals 14 148

Repetitive

subtraction

version

Modulo

version

Page 17: Performance Issues: Algorithm

Which is Better?Which is Better?

From previous comparison

Repetitive subtraction version is better

Example: a = 1000, b = 1

Repetitive subtraction => 999 iterations

(~5000 cycles)

Modulo => 1 iteration (~74 cycles)

17

Page 18: Performance Issues: Algorithm

Blended VersionBlended Version

18

int find_gcd(int a, int b)

{

/* assumes both a and b are greater than 0 */

while (1) {

if (a > (b*4)) {

a = a % b;

if (a == 0) return b;

if (a == 1) return 1;

} else if (a >= b) {

a = a – b;

if (a == 0) return b;

if (a == 1) return 1;

}

if (b > (a*4)) {

b = b % a;

if (b == 0) return a;

if (b == 1) return 1;

} else if (b >= a) {

b = b – a;

if (b == 0) return a;

if (b == 1) return 1;

}

}

}

Page 19: Performance Issues: Algorithm

EvaluationEvaluation

Run for all combinations of values a and b in

[1 .. 9999]

Pentium 4 with 3.6-GHZ

19

Repetitive

Subtraction Version

Modulo Version Blended Version

14.56 sec. 18.55 sec. 12.14 sec.

Page 20: Performance Issues: Algorithm

20

Data dependencies and Instruction Parallelism (1)Data dependencies and Instruction Parallelism (1)

Data dependencies affect the processor’s ability to execute instruction simultaneously

Ex:

a = u * v

b = w * x

c = y * z

clock 0

5 clock

throughput15 clock latency

Page 21: Performance Issues: Algorithm

21

Data dependencies and Instruction Parallelism (2)Data dependencies and Instruction Parallelism (2)

Ex: a = w * x * y * z

w * x

y * z

wx * yz

clock 0

5 clock

throughput15 clock latency

data dependency

Instruction parallelism limited by data dependencies and

latencies is a key limiting factor to algorithm

performance

Page 22: Performance Issues: Algorithm

22

ExampleExample

a = 0;

for (x = 0; x < 1000; x ++)

a += buffer[x];

a = b = c = d = 0;

for (x = 0; x < 250; x ++)

{

a += buffer[x];

b += buffer[x+250];

c += buffer[x+500];

d += buffer[x+750];

}

a = a + b + c + d;

More arithmetic operations can occur on each

clock due to fewer data dependencies

Page 23: Performance Issues: Algorithm

Memory RequirementsMemory Requirements

Fetching main memory is almost the slowest

operations for a processor

An algorithm that uses smaller amounts of memory

is usually faster

Any benefit that an algorithm gains by using extra

memory might be lost due to the speed of the

memory accesses involved

Treat memory accesses as high-latency instructions

23

Page 24: Performance Issues: Algorithm

Generality of AlgorithmsGenerality of Algorithms

Readily available algorithms often solve general

problems

It may be that the specific problem can be solved

more efficiently than solving a more general problem

Example: determine whether the string is empty

strlen function => O(n)

Test the first element of the string to see if it is ‘\0’

or not => O(1)

24

Page 25: Performance Issues: Algorithm

25

Detecting Algorithm IssuesDetecting Algorithm Issues

Using call graph feature in the VTune analyzer

Optimizing the complete algorithm instead of the

individual functions will result in more optimization

opportunities and higher performance

Check CPI by sampling

Clockticks event

Instructions Retired event

Page 26: Performance Issues: Algorithm

26

Key Points (1)Key Points (1)

Selecting the right algorithm is absolutely critical to great performance

Computational complexity

Instruction selections

Memory accesses

Avoiding processor issues

Performance issues on algorithm

Instruction latency, instruction throughput, data dependencies, memory accesses

Page 27: Performance Issues: Algorithm

Key Points (2)Key Points (2)

Keep data dependencies low enough that the

processor is able to execute at least four or more

operations at the same time

Choose algorithm that allow much of the

computation to be done in parallel or to be done

using vector instructions

Tailor algorithms to the problem to eliminate

inefficiency

Use the VTune’s call graph analysis to detect

algorithm’s hotspots

27

Page 28: Performance Issues: Algorithm

BackupBackup

28

Page 29: Performance Issues: Algorithm

29

Approximate Instruction Performance on Pentium 4Approximate Instruction Performance on Pentium 4

Instruction Latency Throughput

Addition, subtraction, increment, decrement, logic, …

0.5 0.5

Push, pop, rotate, shift SIMD memory moves, …

1 1

128-bit SIMD integer operations, … 2 2

Integer multiplication 15 4

Integer division, … 23 23

SIMD single-precision floating-point divide

32 32

Double-precision (64-bit) floating-point division

38 38

Memory operations may take much longer depending upon the state of the cache