42
Computer Organization and Computer Organization and Design Design Performance Performance Montek Singh Montek Singh Mon, Aug 26, 2013 Mon, Aug 26, 2013 Lecture 2 Lecture 2 1

Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Embed Size (px)

Citation preview

Page 1: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Computer Organization and Computer Organization and DesignDesign

PerformancePerformance

Montek SinghMontek Singh

Mon, Aug 26, 2013Mon, Aug 26, 2013

Lecture 2Lecture 2

1

Page 2: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

TopicsTopics Defining “Performance”Defining “Performance” Performance MeasuresPerformance Measures How to improve performanceHow to improve performance Performance pitfallsPerformance pitfalls ExamplesExamples

Reading: Ch 1.4Reading: Ch 1.4

2

Page 3: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Why Study Performance?Why Study Performance? Helps us make intelligent design choicesHelps us make intelligent design choices

See through the marketing hypeSee through the marketing hype

Key to understanding underlying computer Key to understanding underlying computer organizationorganization Why is some hardware faster than others for different Why is some hardware faster than others for different

programs?programs? What factors of system performance are hardware What factors of system performance are hardware

related?related?e.g., Do we need a new machine …e.g., Do we need a new machine …… … or a new operating system?or a new operating system?

How does a machineHow does a machine’’s instruction set affect its s instruction set affect its performance?performance?

3

Page 4: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

What is Performance?What is Performance? Loosely:Loosely:

How fast can a computer complete a taskHow fast can a computer complete a task

Examples of “tasks”:Examples of “tasks”: Short tasks:Short tasks:

Crunch a bunch of numbers (say calculate mean)Crunch a bunch of numbers (say calculate mean)Display a PDF documentDisplay a PDF documentRespond to a game console button pressRespond to a game console button press

Longer ones:Longer ones:Search for a document on hard driveSearch for a document on hard driveRip a CD track into an mp3 fileRip a CD track into an mp3 fileApply a Photoshop filterApply a Photoshop filterTranscode/recode a videoTranscode/recode a video

4

Page 5: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Which Which airplaneairplane is is ““bestbest””??Airplane

Passenger capacity

Cruising range (miles)

Cruising speed (mph)

Passenger throughput (passengers x mph)

5

Page 6: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Which Which airplaneairplane is is ““bestbest””??

How much faster is the Concorde than the 747?How much faster is the Concorde than the 747? 2.2 X 2.2 X (“X” means “factor of”)(“X” means “factor of”)

How much larger is the 747How much larger is the 747’’s capacity than the s capacity than the Concorde?Concorde? 3.6 X3.6 X

It is roughly 4000 miles from Raleigh to Paris. What is the It is roughly 4000 miles from Raleigh to Paris. What is the throughputthroughput of the 747 in passengers/hr? The Concorde? of the 747 in passengers/hr? The Concorde?

What is the What is the latencylatency (trip time) of the 747? The Concorde? (trip time) of the 747? The Concorde? 6.56 hours, 2.96 hours6.56 hours, 2.96 hours

AirplanePassenger capacity

Cruising range (miles)

Cruising speed (mph)

Passenger throughput (passengers x mph)

6

Page 7: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Performance MetricsPerformance Metrics Latency: Time from input to corresponding Latency: Time from input to corresponding

outputoutput How long does it take for my program to run?How long does it take for my program to run? How long must I wait after typing return for the How long must I wait after typing return for the

result?result? Other examples?Other examples?

Throughput: Results produced per unit timeThroughput: Results produced per unit time How many results can be processed per second?How many results can be processed per second? What is the average execution rate of my program?What is the average execution rate of my program? How much work is getting done each second?How much work is getting done each second? Other examples?Other examples?

7

Page 8: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Design TradeoffsDesign Tradeoffs Performance is rarely the sole factorPerformance is rarely the sole factor

Cost is important tooCost is important too Energy/power consumption is often criticalEnergy/power consumption is often critical

Frequently used compound metricsFrequently used compound metrics Performance/Cost (throughput/$)Performance/Cost (throughput/$) Performance/Power (throughput/watt)Performance/Power (throughput/watt) Work/Energy (total work done per joule)Work/Energy (total work done per joule)

for battery-powered devicesfor battery-powered devices

8

Page 9: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Execution TimeExecution Time Elapsed Time/Wall Clock TimeElapsed Time/Wall Clock Time

counts everything (disk and memory accesses, I/O, etc.)counts everything (disk and memory accesses, I/O, etc.) includes the impact of other programsincludes the impact of other programs a useful number, but often not good for comparison a useful number, but often not good for comparison

purposespurposes

CPU timeCPU time does not include I/O or time spent running other does not include I/O or time spent running other

programsprograms can be broken up into system time, and can be broken up into system time, and user timeuser time

Our focus: user CPU time Our focus: user CPU time time spent executing actual instructions of time spent executing actual instructions of ““ourour””

programprogram9

Page 10: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

PerformancePerformance For some program running on machine X,For some program running on machine X,

PerformancePerformanceXX = Program Executions / Time = Program Executions / TimeXX (executions/sec)(executions/sec)

= 1 / Execution Time= 1 / Execution TimeXX

Relative PerformanceRelative Performance "X is "X is nn times faster than Y times faster than Y”” PerformancePerformanceXX / Performance / PerformanceYY = = nn

Example:Example: Machine A runs a program in 20 secondsMachine A runs a program in 20 seconds Machine B runs the same program in 25 secondsMachine B runs the same program in 25 seconds

By how much is A faster than B?By how much is A faster than B?By how much is B slower than A?By how much is B slower than A?PerformanceA = 1/20 PerformanceB = 1/25

Machine A is (1/20)/(1/25) = 1.25 times faster than Machine B10

Page 11: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Performance: Pitfalls of using %Performance: Pitfalls of using % Same Example:Same Example:

Machine A runs a program in 20 sec; B takes 25 secMachine A runs a program in 20 sec; B takes 25 sec By how much is A faster than B?By how much is A faster than B?

Is it (25-20)/25 = 20% faster?Is it (25-20)/25 = 20% faster? Is it (25-20)/20 = 25% faster?Is it (25-20)/20 = 25% faster?Correct answer is: A is (PerfCorrect answer is: A is (PerfAA-Perf-PerfBB)/Per)/PerBB faster faster

– (0.05 - 0.04) / (0.04) = 25% faster(0.05 - 0.04) / (0.04) = 25% faster

By how much is B slower than A?By how much is B slower than A?Correct answer is: B is (PerfCorrect answer is: B is (PerfBB-Perf-PerfAA)/Per)/PerAA faster/slower faster/slower

– (0.04 - 0.05) / (0.05) = -20% faster = 20% slower(0.04 - 0.05) / (0.05) = -20% faster = 20% slower

Confusing: A is 25% faster than B; B is 20% slowerConfusing: A is 25% faster than B; B is 20% slower Better: A is 1.25 times faster than B; B is 1/1.25 as Better: A is 1.25 times faster than B; B is 1/1.25 as

fastfast

Also: %ages are only good up to 100%Also: %ages are only good up to 100% Never say A is 13000% faster than BNever say A is 13000% faster than B 11

Page 12: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

CPU ClockingCPU Clocking Operation of digital hardware governed by a Operation of digital hardware governed by a

constant-rate clockconstant-rate clock

Clock period: duration of a clock cycleClock period: duration of a clock cycle e.g., 250ps = 0.25ns = 250×10e.g., 250ps = 0.25ns = 250×10–12–12ss

Clock frequency (rate): cycles per secondClock frequency (rate): cycles per second e.g., 1/(0.25ns) = 4GHz = 4000MHz = 4.0×10e.g., 1/(0.25ns) = 4GHz = 4000MHz = 4.0×1099HzHz

Clock (cycles)

Data transferand computation

Update state

Clock period

12

Page 13: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Program Clock CyclesProgram Clock Cycles Instead of reporting execution time in seconds, Instead of reporting execution time in seconds,

we often use clock cycle countswe often use clock cycle counts Why? A newer generation of the same processor…Why? A newer generation of the same processor…

Often has the same cycle counts for the same programOften has the same cycle counts for the same programBut often has different clock speed (ex, 1 GHz changes to But often has different clock speed (ex, 1 GHz changes to

1.5 GHz)1.5 GHz)

Clock “ticks” indicate when machine state Clock “ticks” indicate when machine state changes changes an abstraction: allows time to be discrete instead of an abstraction: allows time to be discrete instead of

continuouscontinuous

Simple relation:Simple relation:

time

Rate Clock

Cycles Clock CPU

Time Cycle ClockCycles Clock CPUTime CPU

13

Page 14: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Program Clock CyclesProgram Clock Cycles Relation:Relation:

or:or:

cycle time = time between ticks = seconds per cyclecycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz = 1 clock rate (frequency) = cycles per second (1 Hz = 1

cycle/s)cycle/s)

A 200 MHz clock has a cycle time of:A 200 MHz clock has a cycle time of:

Rate Clock

Cycles Clock CPU

Time Cycle ClockCycles Clock CPUTime CPU

14

Page 15: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Instruction Count and CPIInstruction Count and CPI

Instruction Count for a programInstruction Count for a program Determined byDetermined by

programprogram Instruction set architecture (ISA)Instruction set architecture (ISA)compilercompiler

Average cycles per instruction (“CPI”)Average cycles per instruction (“CPI”) Determined by CPU hardwareDetermined by CPU hardware If different instructions have different CPIIf different instructions have different CPI

Average CPI affected by instruction mixAverage CPI affected by instruction mix

Rate Clock

CPICount nInstructio

Time Cycle ClockCPICount nInstructioTime CPU

nInstructio per CyclesCount nInstructioCycles Clock

15

Page 16: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Computer Performance MeasureComputer Performance Measure

Millions of Instructions per Second Frequency in MHz

Historically: PDP-11, VAX, Intel 8086: CPI > 1

Load/Store RISC machinesMIPS, SPARC, PowerPC, miniMIPS: CPI = 1Modern CPUs, Pentium, Athlon : CPI < 1

CPI (Average Clocks Per Instruction)

Unfortunate coincidence:This “MIPS” has nothing to do with the name of the MIPS processor we will be studying!

16

Page 17: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

How to Improve Performance?How to Improve Performance? Many ways to write the same equations:Many ways to write the same equations:

So, to improve performance (everything else So, to improve performance (everything else being equal) you can eitherbeing equal) you can either ________ the # of required cycles for a program;________ the # of required cycles for a program; ________ the clock cycle time or, said another way, ________ the clock cycle time or, said another way, ________ the clock rate;________ the clock rate; ________ the CPI (average clocks per instruction).________ the CPI (average clocks per instruction).

MIPS PitfallMIPS Pitfall Cannot compare MIPS of two different processors if they run Cannot compare MIPS of two different processors if they run

different sets of instructions!different sets of instructions!

MMeaningless eaningless IIndicator of ndicator of PProcessor rocessor SSpeed!peed!

Decrease Decrease

Increase

Decrease

17

Page 18: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

How Many Cycles in a Program?How Many Cycles in a Program? For For somesome processors (e.g., MIPS processor) processors (e.g., MIPS processor)

# of cycles = # of instructions# of cycles = # of instructions

This assumption can be incorrect,Different instructions take different amounts of time on different machines.Memory accesses might require more cycles than other instructions.Floating-Point instructions might require multiple clock cycles to execute.Branches might stall execution rate

time

1st

inst

ruct

ion

2nd

inst

ruct

ion

3rd

inst

ruct

ion

4th

5th

6th ..

.

18

Page 19: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

ExampleExample Our favorite program runs in 10 seconds on computer Our favorite program runs in 10 seconds on computer

A, which has a 2 GHz clock. We are trying to help a A, which has a 2 GHz clock. We are trying to help a computer designer build a new machine B, to run this computer designer build a new machine B, to run this program in 6 seconds. The designer can use new (or program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate as machine A for the same program. What clock rate should we tell the designer to target?should we tell the designer to target?

19

Page 20: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Now that we understand cyclesNow that we understand cycles A given program will requireA given program will require

some number of instructions (machine instructions)some number of instructions (machine instructions) some number of cyclessome number of cycles some number of secondssome number of seconds

We have a vocabulary that relates these We have a vocabulary that relates these quantities:quantities: cycle time cycle time (seconds per cycle)(seconds per cycle) clock rate clock rate (cycles per second)(cycles per second) CPICPI (average clocks per instruction) (average clocks per instruction)

a floating point intensive application might have a higher a floating point intensive application might have a higher CPICPI

MIPS MIPS (millions of instructions per second)(millions of instructions per second) this would be higher for a program using simple this would be higher for a program using simple

instructionsinstructions20

Page 21: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Performance TrapsPerformance Traps Performance is determined by the execution Performance is determined by the execution

time of a program that you care about.time of a program that you care about.

Do any of the other variables equal Do any of the other variables equal performance?performance? # of cycles to execute program?# of cycles to execute program? # of instructions in program?# of instructions in program? # of cycles per second?# of cycles per second? average # of cycles per instruction?average # of cycles per instruction? average # of instructions per second?average # of instructions per second?

Common pitfall:Common pitfall: Thinking that only one of the variables is indicative of Thinking that only one of the variables is indicative of

performance when it really is not!performance when it really is not! 21

Page 22: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

CPI ExampleCPI Example Suppose we have two implementations of the Suppose we have two implementations of the

same instruction set architecture (ISA)same instruction set architecture (ISA) If two machines have the same ISA which quantity If two machines have the same ISA which quantity

(e.g., clock rate, CPI) is the same?(e.g., clock rate, CPI) is the same? For some program:For some program:

Computer A has a clock cycle time of 250 ps and a CPI of Computer A has a clock cycle time of 250 ps and a CPI of 2.02.0

Computer B has a clock cycle time of 500 ps and a CPI of Computer B has a clock cycle time of 500 ps and a CPI of 1.21.2

What machine is faster for this program, and by how What machine is faster for this program, and by how much?much?

A is faster by 1.2 X

22

Page 23: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

CPI in More DetailCPI in More Detail If different instruction classes take different If different instruction classes take different

numbers of cyclesnumbers of cycles

Weighted average CPI:Weighted average CPI:

n

1iii )Count nInstructio(CPICycles Clock

n

1i

ii Count nInstructio

Count nInstructioCPI

Count nInstructio

Cycles ClockCPI

Relative frequency

23

Page 24: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Example: CompilerExample: Compiler’’s Impacts Impact Two different compilers are being tested for a 500 MHz Two different compilers are being tested for a 500 MHz

machine with three different classes of instructions: Class machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B uses 5 million Class A instructions, 1 million Class B instructions, and 2 million Class C instructions. The second instructions, and 2 million Class C instructions. The second compiler's code uses 7 million Class A instructions, 1 million compiler's code uses 7 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions.Class B instructions, and 1 million Class C instructions.

Which program uses fewer instructions?Which program uses fewer instructions? InstructionsInstructions11 = (5+1+2) x 10 = (5+1+2) x 1066 = 8 x 10 = 8 x 1066

InstructionsInstructions22 = (7+1+1) x 10 = (7+1+1) x 1066 = 9 x 10 = 9 x 1066

Which sequence uses fewer clock cycles?Which sequence uses fewer clock cycles? CyclesCycles11 = (5(1)+1(2)+2(3)) x 10 = (5(1)+1(2)+2(3)) x 1066 = 13 x 10 = 13 x 1066

CyclesCycles22 = (7(1)+1(2)+1(3)) x 10 = (7(1)+1(2)+1(3)) x 1066 = 12 x 10 = 12 x 1066

CPI1 = ?13/8 = 1.625

CPI2 = ?12/9 = 1.33

24

Page 25: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

CPI ExampleCPI Example Alternative compiled code versions using Alternative compiled code versions using

instructions in classes A, B, Cinstructions in classes A, B, C

Version 1: IC = 5Version 1: IC = 5 Clock CyclesClock Cycles

= 2×1 + 1×2 + 2×3= 2×1 + 1×2 + 2×3= 10= 10

Avg. CPI = 10/5 = 2.0Avg. CPI = 10/5 = 2.0

Class A B C

CPI for class 1 2 3

IC for version 1 2 1 2

IC for version 2 4 1 1

Version 2: IC = 6Version 2: IC = 6 Clock CyclesClock Cycles

= 4×1 + 1×2 + 1×3= 4×1 + 1×2 + 1×3= 9= 9

Avg. CPI = 9/6 = 1.5Avg. CPI = 9/6 = 1.525

Page 26: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Performance SummaryPerformance Summary

Performance depends onPerformance depends on Algorithm: affects IC, possibly CPIAlgorithm: affects IC, possibly CPI Programming language: affects IC, CPIProgramming language: affects IC, CPI Compiler: affects IC, CPICompiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Cycle TimeInstruction set architecture: affects IC, CPI, Cycle Time

The BIG Picture

26

Page 27: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

BenchmarksBenchmarks Performance best determined by running a real Performance best determined by running a real

applicationapplication Use programs typical of expected workloadUse programs typical of expected workload Or, typical of expected class of applicationsOr, typical of expected class of applications

e.g., compilers/editors, scientific applications, graphics, etc.e.g., compilers/editors, scientific applications, graphics, etc.

Small benchmarksSmall benchmarks nice for architects and designersnice for architects and designers easy to standardizeeasy to standardize can be abusedcan be abused

SPEC (System Performance Evaluation Cooperative)SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real program and inputscompanies have agreed on a set of real program and inputs can still be abusedcan still be abused valuable indicator of performance (and compiler valuable indicator of performance (and compiler

technology)technology)

27

Page 28: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

SPEC ’95SPEC ’95

Periodically updated with newer benchmarksPeriodically updated with newer benchmarks SPEC 2000, SPEC 2006SPEC 2000, SPEC 2006

Benchmark Description

go Artificial intelligence; plays the game of Gom88ksim Motorola 88k chip simulator; runs test programgcc The Gnu C compiler generating SPARC codecompress Compresses and decompresses file in memoryli Lisp interpreterijpeg Image compression and decompressionperl Manipulates strings and prime numbers in the special-purpose programming language Perlvortex A database programtomcatv A mesh generation programswim Shallow water model with 513 x 513 gridsu2cor quantum physics; Monte Carlo simulationhydro2d Astrophysics; Hydrodynamic Naiver Stokes equationsmgrid Multigrid solver in 3-D potential fieldapplu Parabolic/elliptic partial differential equationstrub3d Simulates isotropic, homogeneous turbulence in a cubeapsi Solves problems regarding temperature, wind velocity, and distribution of pollutantfpppp Quantum chemistrywave5 Plasma physics; electromagnetic particle simulation

28

Page 29: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

SPEC 2006SPEC 2006 Interesting to see how the mix of applications Interesting to see how the mix of applications

has changed over the years…has changed over the years… See http://www.spec.org/cpu2006/{CINT2006, See http://www.spec.org/cpu2006/{CINT2006,

CFP2006}CFP2006}

29

Page 30: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Other Popular BenchmarksOther Popular Benchmarks Several others Several others

popularpopular industry uses SPECindustry uses SPEC but ordinary but ordinary

consumers use othersconsumers use othersmore representative of more representative of

the work they do!the work they do!– e.g., gaming, e.g., gaming,

Photoshop/Aperture, Photoshop/Aperture, copying huge files, copying huge files, multimedia multimedia coding/decoding, etc.coding/decoding, etc.

Geekbench is quite Geekbench is quite popular!popular!

30

Page 31: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Amdahl’s LawAmdahl’s Law Possibly the most important law regarding Possibly the most important law regarding

computer performance:computer performance:

Principle: Principle: Make the common case fast!Make the common case fast! Eventually, performance gains will be limited by what Eventually, performance gains will be limited by what

cannot be improvedcannot be improvede.g., you can raise the speed limit, but there are still traffic e.g., you can raise the speed limit, but there are still traffic

lightslights

31

Page 32: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Amdahl’s Law: ExampleAmdahl’s Law: Example

Example: "Suppose a program runs in 100 Example: "Suppose a program runs in 100 seconds on a machine, where multiplies are seconds on a machine, where multiplies are executed 80% of the time. How much do we executed 80% of the time. How much do we need to improve the speed of multiplication if need to improve the speed of multiplication if we want the program to run 4 times faster?"we want the program to run 4 times faster?"

How about making it 5 times faster?How about making it 5 times faster?

25 = 80/r + 20 r = 16x

20 = 80/r + 20 r = ?

32

Page 33: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

ExampleExample Suppose we enhance a machine making all floating-Suppose we enhance a machine making all floating-

point instructions run FIVE times faster. If the point instructions run FIVE times faster. If the execution time of some benchmark before the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will floating-point enhancement is 10 seconds, what will the speedup be if only half of the 10 seconds is spent the speedup be if only half of the 10 seconds is spent executing floating-point instructions?executing floating-point instructions?

We are looking for a benchmark to show off the new We are looking for a benchmark to show off the new floating-point unit described above, and want the floating-point unit described above, and want the overall benchmark to show at least a speedup of 3. overall benchmark to show at least a speedup of 3. What percentage of the execution time would floating-What percentage of the execution time would floating-point instructions have to account for in this program point instructions have to account for in this program in order to yield our desired speedup on this in order to yield our desired speedup on this benchmark?benchmark?

5/5 + 5 = 6 Relative Perf = 10/6 = 1.67 x

100/3 = f/5 + (100 – f) = 100 – 4f/5 f = 83.33 33

Page 34: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Power TrendsPower Trends

In CMOS IC technologyIn CMOS IC technology

×1000×30 5V → 1V

34

Page 35: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Reducing PowerReducing Power Suppose a new CPU hasSuppose a new CPU has

85% of capacitive load of old CPU85% of capacitive load of old CPU 15% voltage and 15% frequency reduction15% voltage and 15% frequency reduction

The power wallThe power wall We cannot reduce voltage furtherWe cannot reduce voltage further We cannot remove more heatWe cannot remove more heat

How else can we improve performance?How else can we improve performance?

35

Page 36: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Uniprocessor PerformanceUniprocessor Performance

Constrained by power, instruction-level parallelism, memory latency

36

Page 37: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

MultiprocessorsMultiprocessors ““Multicore” microprocessorsMulticore” microprocessors

More than one processor core per chipMore than one processor core per chip

Requires explicitly parallel programmingRequires explicitly parallel programming Hardware executes multiple instructions at onceHardware executes multiple instructions at once

Ideally, hidden from the programmerIdeally, hidden from the programmer Hard to doHard to do

Programming for performanceProgramming for performanceLoad balancingLoad balancingOptimizing communication and synchronizationOptimizing communication and synchronizationBut, newer OSes and libraries have been designed for thisBut, newer OSes and libraries have been designed for this

37

Page 38: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Manufacturing ICsManufacturing ICs IC = Integrated Circuit (“chip”)IC = Integrated Circuit (“chip”)

Yield: proportion of working dies per waferYield: proportion of working dies per wafer38

Page 39: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Integrated Circuit CostIntegrated Circuit Cost

Nonlinear relation to area and defect rateNonlinear relation to area and defect rate Wafer cost and area are fixedWafer cost and area are fixed Defect rate determined by manufacturing processDefect rate determined by manufacturing process Die area determined by architecture and circuit Die area determined by architecture and circuit

designdesign

39

Page 40: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Fallacy: Low Power at IdleFallacy: Low Power at Idle AMD X4 power benchmark:AMD X4 power benchmark:

At 100% load: 295W (max power)At 100% load: 295W (max power) At 50% load: 246W (83% max power)At 50% load: 246W (83% max power) At 10% load: 180W (still consumes 61% of max At 10% load: 180W (still consumes 61% of max

power)power)

Google data centerGoogle data center Mostly operates at 10% - 50% loadMostly operates at 10% - 50% load At 100% load less than 1% of the timeAt 100% load less than 1% of the time

Industry challenge: Design processors to Industry challenge: Design processors to make power proportional to loadmake power proportional to load

40

Page 41: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

RememberRemember Performance is specific to a particular programPerformance is specific to a particular program

Total execution time is a consistent summary of Total execution time is a consistent summary of performanceperformance

For a given architecture performance comes from:For a given architecture performance comes from: increases in clock rate (without adverse CPI affects)increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPIimprovements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction compiler enhancements that lower CPI and/or instruction

countcount

Pitfall: Expecting improvements in one aspect of Pitfall: Expecting improvements in one aspect of a machinea machine’’s performance to affect the total s performance to affect the total performanceperformance

You should not always believe everything you You should not always believe everything you read! read! Read carefully! Especially what the business guys say!Read carefully! Especially what the business guys say!

41

Page 42: Computer Organization and Design Performance Montek Singh Mon, Aug 26, 2013 Lecture 2 1

Concluding RemarksConcluding Remarks Cost/performance is improvingCost/performance is improving

Due to underlying technology developmentDue to underlying technology development

Hierarchical layers of abstractionHierarchical layers of abstraction In both hardware and softwareIn both hardware and software

Instruction set architectureInstruction set architecture The hardware/software interfaceThe hardware/software interface

Execution time: the best performance measureExecution time: the best performance measure Power is a limiting factorPower is a limiting factor

Use parallelism to improve performanceUse parallelism to improve performance

42