Upload
dilum-bandara
View
313
Download
1
Tags:
Embed Size (px)
Citation preview
CPU Performance
Enhancements
CS2052 Computer Architecture
Computer Science & Engineering
University of Moratuwa
Dilum [email protected]
Pipelining – It’s Natural!
Laundry example
Amal, Bimal, Chamal, & Dinal
each have one load of clothes
to wash, dry, & fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes
A B C D
2
Sequential Laundry
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
3
Pipelined Laundry – Start Work ASAP
Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
4
Pipelining Lessons
Pipelining doesn’t reduce
latency of a single task
Improve throughput of entire
workload
Pipeline rate limited by
slowest pipeline stage
Multiple tasks operating
simultaneously
Potential speedup = No pipe
stages
Unbalanced lengths of pipe
stages reduces speedup
Time to fill pipeline & time to
drain/flush it reduces
speedup
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
6
Source:
http://mail.humber.ca/~paul.mi
chaud/Pipeline.htm
Instruction Level
Parallelism (ILP)
CPU Pipelines
7Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline
5-stage MIPS
pipeline
8
Pipeline With a Branch Penalty
Due to a Taken Branch
9
Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm
Superscalar Architectures
Executes more than 1 instruction during a clock
cycle by simultaneously dispatching multiple
instructions to redundant functional units
10
Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm
Intel Hyper Threading (HT)
Introduced with Intel Pentium 4
Allows 2 different resources of CPU to be used at
the same time
While 1st thread (instruction) is working with integers
(ALU’s integer unit) 2nd thread can work on floating
point numbers (ALU’s floating point unit)
OS feels that there are 2 logical CPUs
Achieved through a mix of shared, replicated, &
partitioned chip resources such as:
Registers
Arithmetic units
Cache memory 11
Amdahl’s Law
What’s maximum expected improvement to an
overall system when only part of it is improved?
Amdahl said this relationship is not linear
12
Amdahl’s Law (Cont.)
13
Best you could ever hope to do
enhanced
maximumFraction - 1
1 Speedup
Amdahl’s Law – Example
Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
14
Speedupoverall =1
0.95= 1.053
ExTimenew = ExTimeold × (0.9 + 0.1/2) = 0.95 × ExTimeold
Moore’s Law – Today’s Status
15
Moore’s Law – No of
transistors on a chip
tends to double about
every 2 years
Transistor
count still
rising
Clock speed
flattening
sharply
www.extremetech.com/wp-
content/uploads/2012/02/CPU-Scaling.jpg
Dual Core
Introduced by IBM Power4
However, AMD brought it to consumer market
Combines 2 independent CPUs & their
respective caches onto a single silicon chip
Provide better performance improvement than
HT
True parallelism
16
Multi-Core
17
Source: www.anandtech.com/show/5174/why-ivy-bridge-is-
still-quad-core
Multi-Core (Cont.)
18Source: www.legitreviews.com/intel-core-i7-4770k-haswell-3-5ghz-quad-core-cpu-review_2203
Multi-Core (Cont.)
19
Source: www.hardwarecanucks.com/news/cpu/intel-launch-8-core-xeon-nehalemex/
Multi-Cores + Hyper Threading
20
Source: www.notebookcheck.net/Intel-Core-i7-Notebook-Processor-Clarksfield.21025.0.html
NVIDIA Tesla 2070
Many-Cores
GPUs
Graphic Processing Unit
NVIDIA & ATI
SIMD – Single Instruction Multiple Data
Intel Xeon Phi
General purpose
21
Intel Xeon Phi
Example Specifications
22
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes
CPU vs. GPU Architecture
23
GPU devotes more transistors for computation
Multithreaded SIMD Processor
24
Source: Computer Architecture by
John L. Hennessy and David A.
Patterson
NVIDIA CUDA Architecture
25
Intel Xeon Phi
26
Source: www.pcgameshardware.de/Xeon-Phi-Hardware-256199/News/Intel-Xeon-Phi-Hardware-
Informationen-1040924/
Intel Xeon Phi (Cont.)
27Source: www.altera.com/technology/system-design/articles/2012/multicore-many-core.html
Power Consumption
Dynamic energy
Transistor switch from 0 1 or 1 0
½ × Capacitive load × Voltage2
Dynamic power
½ × Capacitive load × Voltage2 × Frequency switched
Static power consumption
Currentstatic × Voltage
Scales with no of transistors
Reducing voltage reduces energy
Reducing clock rate reduces power, not energy
Power gating than not only taking out clock signal28