Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Performance Issues:AlgorithmPerformance Issues:Algorithm
Peng-Sheng Chen
1
2
AlgorithmAlgorithm
A good algorithm solves a problem in a fast and efficient manner
A poor algorithm, no matter how well implemented, is never as fast
Algorithm performance can be evaluated by computational complexity, Big-O
3
Computational Complexity (1)Computational Complexity (1)
Complexity 256 elements1000 elements
10000 elements
Bubble Sort n^265,000
operations1,000,000 100,000,000
QuickSortn log n
(average)2,048 9,965 133,000
4
Computational Complexity (2)Computational Complexity (2)
Computational complexity only takes into account
loop iterations, and not all of the factors affecting an
algorithm performance
You need to consider additional things to compare
algorithms, especially when the computational
complexity are similar
5
Choice of Instructions (1)Choice of Instructions (1)
The instructions needed to implement an algorithm have a big impact on performance
Ex: integer addition => 1 clock
Ex: integer division => 68 clocks
How to evaluate the speed of an instruction?
Instruction latency
Instruction throughput
6
Choice of Instructions (2)Choice of Instructions (2)
Instruction latency
The number of clocks required to complete one
instruction after the instruction’s inputs are ready
and execution begins
Ex: integer multiplication => 9 clocks
7
Choice of Instructions (3)Choice of Instructions (3) Instruction throughput
The number of clocks that the processor is
required to wait before starting the execution of
an identical instruction
Instruction throughput is always less than or
equal to instruction latency
Instruction pipelining causes the differences
Ex: integer multiplication throughput => 4 clocks
A new multiply can begin execution every 4
clocks
8
Algorithm SelectionAlgorithm Selection
Computational complexity
Take latency and throughput into account
Ex:
Algorithm 1
10 additions
Algorithm 2
1 divide
Algorithm 1 will be faster because divides take
60 times longer than additions to execution
9
Example: Finding 最大公因數Example: Finding 最大公因數
Basic steps
Factor each number
Find the factors that are common between
both numbers
Multiply the common factors together to get
the greatest common divisor
10
Finding 最大公因數Finding 最大公因數
Ex: two numbers 40, 48
Basic steps
Factor each number
40 = 2 * 2 * 2 * 5
48 = 2 * 2 * 2 * 2 * 3
Find the common factors
2 * 2 * 2
Multiply the common factors to get the greatest common divisor
GCD = 2 * 2 * 2 = 8
It will take a long time
11
Euclid’s Algorithm for GCD (1)Euclid’s Algorithm for GCD (1)
Basic steps
Larger number = larger number – smaller number
If the numbers are the same, it is the greatest
common divisor, otherwise go to step 1
12
Euclid’s Algorithm for GCD (2)Euclid’s Algorithm for GCD (2)
Two numbers 48, 40
Basic steps
48, 40 → 48 – 40 = 8 40
8 != 40, so repeat step 1
8, 40 → 40 – 8 = 32 8
8 != 32, so repeat step 1
8, 32 → 32 – 8 = 24 8
8 != 24, so repeat step 1
8, 24 → 24 – 8 = 16 8
8 != 16, so repeat step 1
8, 16 → 16 – 8 = 8 8
8 = 8, so 8 is the GCD
13
Euclid’s Algorithm for GCD (3)Euclid’s Algorithm for GCD (3)
int find_gcf(int a, int b)
{
/* assumes both a and b are greater than 0 */
while (1) {
if (a > b)
a = a – b;
else if (a < b)
b = b – a;
else /* they are equal */
return a;
}
For a=48, and b=40,
5 compares, 14 branches, and 5 subtracts => a total of 24
instructions
14
Euclid’s Algorithm for GCD (4)Euclid’s Algorithm for GCD (4)int find_gcd(int a, int b)
{
/* assumes both a and b are greater than 0 */
while (1) {
a = a % b;
if (a == 0) return b;
if (a == 1) return 1;
b = b % a;
if (b == 0) return a;
if (b == 1) return 1;
}
} Variation of Euclid algorithm
For a=48, and b=40,
2 divides, 3 compares, 3 branches, 4 moves, and 2 cdq
instructions => a total of 14 instructions
cdq => convert double-
word to quad-word
15
A Variation of Euclid’s Algorithm for GCDA Variation of Euclid’s Algorithm for GCD
Two numbers a = 48, b = 40
Basic steps
a = 48 % 40 = 8
Test a equal to 0?
Test a equal to 1?
b = 40 % 8 = 0
Test b equal to 0?
Modulo version is faster than subtraction version?
16
A Rough ComparisonA Rough Comparison
Instruction Quantity Latency Total clocks
Subtractions 5 1 5
Compares 5 1 5
Branches 14 1 14
Other 0 1 0
Totals 24 24
Instruction Quantity Latency Total clocks
Modulo 2 68 136
Compares 3 1 3
Branches 3 1 3
Other 6 1 6
Totals 14 148
Repetitive
subtraction
version
Modulo
version
Which is Better?Which is Better?
From previous comparison
Repetitive subtraction version is better
Example: a = 1000, b = 1
Repetitive subtraction => 999 iterations
(~5000 cycles)
Modulo => 1 iteration (~74 cycles)
17
Blended VersionBlended Version
18
int find_gcd(int a, int b)
{
/* assumes both a and b are greater than 0 */
while (1) {
if (a > (b*4)) {
a = a % b;
if (a == 0) return b;
if (a == 1) return 1;
} else if (a >= b) {
a = a – b;
if (a == 0) return b;
if (a == 1) return 1;
}
if (b > (a*4)) {
b = b % a;
if (b == 0) return a;
if (b == 1) return 1;
} else if (b >= a) {
b = b – a;
if (b == 0) return a;
if (b == 1) return 1;
}
}
}
EvaluationEvaluation
Run for all combinations of values a and b in
[1 .. 9999]
Pentium 4 with 3.6-GHZ
19
Repetitive
Subtraction Version
Modulo Version Blended Version
14.56 sec. 18.55 sec. 12.14 sec.
20
Data dependencies and Instruction Parallelism (1)Data dependencies and Instruction Parallelism (1)
Data dependencies affect the processor’s ability to execute instruction simultaneously
Ex:
a = u * v
b = w * x
c = y * z
clock 0
5 clock
throughput15 clock latency
21
Data dependencies and Instruction Parallelism (2)Data dependencies and Instruction Parallelism (2)
Ex: a = w * x * y * z
w * x
y * z
wx * yz
clock 0
5 clock
throughput15 clock latency
data dependency
Instruction parallelism limited by data dependencies and
latencies is a key limiting factor to algorithm
performance
22
ExampleExample
a = 0;
for (x = 0; x < 1000; x ++)
a += buffer[x];
a = b = c = d = 0;
for (x = 0; x < 250; x ++)
{
a += buffer[x];
b += buffer[x+250];
c += buffer[x+500];
d += buffer[x+750];
}
a = a + b + c + d;
More arithmetic operations can occur on each
clock due to fewer data dependencies
Memory RequirementsMemory Requirements
Fetching main memory is almost the slowest
operations for a processor
An algorithm that uses smaller amounts of memory
is usually faster
Any benefit that an algorithm gains by using extra
memory might be lost due to the speed of the
memory accesses involved
Treat memory accesses as high-latency instructions
23
Generality of AlgorithmsGenerality of Algorithms
Readily available algorithms often solve general
problems
It may be that the specific problem can be solved
more efficiently than solving a more general problem
Example: determine whether the string is empty
strlen function => O(n)
Test the first element of the string to see if it is ‘\0’
or not => O(1)
24
25
Detecting Algorithm IssuesDetecting Algorithm Issues
Using call graph feature in the VTune analyzer
Optimizing the complete algorithm instead of the
individual functions will result in more optimization
opportunities and higher performance
Check CPI by sampling
Clockticks event
Instructions Retired event
26
Key Points (1)Key Points (1)
Selecting the right algorithm is absolutely critical to great performance
Computational complexity
Instruction selections
Memory accesses
Avoiding processor issues
Performance issues on algorithm
Instruction latency, instruction throughput, data dependencies, memory accesses
Key Points (2)Key Points (2)
Keep data dependencies low enough that the
processor is able to execute at least four or more
operations at the same time
Choose algorithm that allow much of the
computation to be done in parallel or to be done
using vector instructions
Tailor algorithms to the problem to eliminate
inefficiency
Use the VTune’s call graph analysis to detect
algorithm’s hotspots
27
BackupBackup
28
29
Approximate Instruction Performance on Pentium 4Approximate Instruction Performance on Pentium 4
Instruction Latency Throughput
Addition, subtraction, increment, decrement, logic, …
0.5 0.5
Push, pop, rotate, shift SIMD memory moves, …
1 1
128-bit SIMD integer operations, … 2 2
Integer multiplication 15 4
Integer division, … 23 23
SIMD single-precision floating-point divide
32 32
Double-precision (64-bit) floating-point division
38 38
Memory operations may take much longer depending upon the state of the cache