CPU Performance Enhancements

CPU Performance

Enhancements

CS2052 Computer Architecture

Computer Science & Engineering

University of Moratuwa

Dilum [email protected]

Pipelining – It’s Natural!

Laundry example

Amal, Bimal, Chamal, & Dinal

each have one load of clothes

to wash, dry, & fold

Washer takes 30 minutes

Dryer takes 40 minutes

Folder takes 20 minutes

A B C D

2

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads

If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

3

Pipelined Laundry – Start Work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

4

Pipelining Lessons

Pipelining doesn’t reduce

latency of a single task

Improve throughput of entire

workload

Pipeline rate limited by

slowest pipeline stage

Multiple tasks operating

simultaneously

Potential speedup = No pipe

stages

Unbalanced lengths of pipe

stages reduces speedup

Time to fill pipeline & time to

drain/flush it reduces

speedup

A

B

C

D

6 PM 7 8 9

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

6

Source:

http://mail.humber.ca/~paul.mi

chaud/Pipeline.htm

Instruction Level

Parallelism (ILP)

http://mail.humber.ca/~paul.michaud/Pipeline.htm

CPU Pipelines

7Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline

5-stage MIPS

pipeline

http://en.wikipedia.org/wiki/Classic_RISC_pipeline

8

Pipeline With a Branch Penalty

Due to a Taken Branch

9

Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm

Superscalar Architectures

Executes more than 1 instruction during a clock

cycle by simultaneously dispatching multiple

instructions to redundant functional units

10

Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm

http://mail.humber.ca/~paul.michaud/Pipeline.htm

Intel Hyper Threading (HT)

Introduced with Intel Pentium 4

Allows 2 different resources of CPU to be used at

the same time

While 1st thread (instruction) is working with integers

(ALU’s integer unit) 2nd thread can work on floating

point numbers (ALU’s floating point unit)

OS feels that there are 2 logical CPUs

Achieved through a mix of shared, replicated, &

partitioned chip resources such as:

Registers

Arithmetic units

Cache memory 11

Amdahl’s Law

What’s maximum expected improvement to an

overall system when only part of it is improved?

Amdahl said this relationship is not linear

12

Amdahl’s Law (Cont.)

13

Best you could ever hope to do

enhanced

maximumFraction - 1

1 Speedup

Amdahl’s Law – Example

Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

14

Speedupoverall =1

0.95= 1.053

ExTimenew = ExTimeold × (0.9 + 0.1/2) = 0.95 × ExTimeold

Moore’s Law – Today’s Status

15

Moore’s Law – No of

transistors on a chip

tends to double about

every 2 years

Transistor

count still

rising

Clock speed

flattening

sharply

www.extremetech.com/wp-

content/uploads/2012/02/CPU-Scaling.jpg

http://www.extremetech.com/wp-content/uploads/2012/02/CPU-Scaling.jpg

Dual Core

Introduced by IBM Power4

However, AMD brought it to consumer market

Combines 2 independent CPUs & their

respective caches onto a single silicon chip

Provide better performance improvement than

HT

True parallelism

16

Multi-Core

17

Source: www.anandtech.com/show/5174/why-ivy-bridge-is-

still-quad-core

http://www.anandtech.com/show/5174/why-ivy-bridge-is-still-quad-core

Multi-Core (Cont.)

18Source: www.legitreviews.com/intel-core-i7-4770k-haswell-3-5ghz-quad-core-cpu-review_2203

http://www.legitreviews.com/intel-core-i7-4770k-haswell-3-5ghz-quad-core-cpu-review_2203

Multi-Core (Cont.)

19

Source: www.hardwarecanucks.com/news/cpu/intel-launch-8-core-xeon-nehalemex/

http://www.hardwarecanucks.com/news/cpu/intel-launch-8-core-xeon-nehalemex/

Multi-Cores + Hyper Threading

20

Source: www.notebookcheck.net/Intel-Core-i7-Notebook-Processor-Clarksfield.21025.0.html

http://www.notebookcheck.net/Intel-Core-i7-Notebook-Processor-Clarksfield.21025.0.html

NVIDIA Tesla 2070

Many-Cores

GPUs

Graphic Processing Unit

NVIDIA & ATI

SIMD – Single Instruction Multiple Data

Intel Xeon Phi

General purpose

21

Intel Xeon Phi

Example Specifications

22

GTX 480 Tesla 2070 Tesla K80

Peak double

precision FP

performance

650 Gigaflops 515 Gigaflops 2.91 Teraflops

Peak single

precision FP

performance

1.3 Teraflops 1.03 Teraflops 8.74 Teraflops

CUDA cores 480 448 4992

Frequency of CUDA

Cores

1.40 GHz 1.15 GHz 560/875 MHz

Memory size

(GDDR5)

1536 MB 6 GB 24 GB

Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec

ECC Memory No Yes Yes

CPU vs. GPU Architecture

23

GPU devotes more transistors for computation

Multithreaded SIMD Processor

24

Source: Computer Architecture by

John L. Hennessy and David A.

Patterson

NVIDIA CUDA Architecture

25

Intel Xeon Phi

26

Source: www.pcgameshardware.de/Xeon-Phi-Hardware-256199/News/Intel-Xeon-Phi-Hardware-

Informationen-1040924/

http://www.pcgameshardware.de/Xeon-Phi-Hardware-256199/News/Intel-Xeon-Phi-Hardware-Informationen-1040924/

Intel Xeon Phi (Cont.)

27Source: www.altera.com/technology/system-design/articles/2012/multicore-many-core.html

http://www.altera.com/technology/system-design/articles/2012/multicore-many-core.html

Power Consumption

Dynamic energy

Transistor switch from 0 1 or 1 0

½ × Capacitive load × Voltage2

Dynamic power

½ × Capacitive load × Voltage2 × Frequency switched

Static power consumption

Currentstatic × Voltage

Scales with no of transistors

Reducing voltage reduces energy

Reducing clock rate reduces power, not energy

Power gating than not only taking out clock signal28