L03 Principles

8/9/2019 L03 Principles

1/37

1

Roman

Japanese

Chinese (compute in hex?)


2/37

2

COMP 206:COMP 206:Computer Architecture andComputer Architecture andImplementationImplementation

Montek SinghMontek Singh Thu, Jan 22, 2009 Thu, Jan 22, 2009

Lecture 3: Quantitative PrinciplesLecture 3: Quantitative Princip

les


3/37

3

Quantitative Principles of ComputerQuantitative Principles of ComputerDesignDesign This is intro to design and analysis This is intro to design and analysis

Take Advantage of Parallelism Take Advantage of ParallelismPrinciple of LocalityPrinciple of LocalityFocus on the Common CaseFocus on the Common CaseAmdahls LawAmdahls Law

The Processor Performance Equation The Processor Performance Equation


4/37

4

1) Taking Advantage of Parallelism1) Taking Advantage of Parallelism(exs.)(exs.)Increase throughput of server computer viaIncrease throughput of server computer via

multiple processors or multiple disksmultiple processors or multiple disksDetailed HW designDetailed HW design

Carry lookahead adders uses parallelism to speed upCarry lookahead adders uses parallelism to speed upcomputing sums from linear to logarithmic in numbercomputing sums from linear to logarithmic in number

of bits per operandof bits per operandMultiple memory banks searched in parallel in set-Multiple memory banks searched in parallel in set-associative cachesassociative caches

Pipelining (next slides)Pipelining (next slides)


5/37

5

PipeliningPipeliningOverlap instruction executionOverlap instruction execution

to reduce the total time to complete an instructionto reduce the total time to complete an instructionsequence.sequence.

Not every instruction depends on immediateNot every instruction depends on immediatepredecessorpredecessor

executing instructions completely/partially inexecuting instructions completely/partially inparallel possibleparallel possible

Classic 5-stage pipeline:Classic 5-stage pipeline:1) Instruction Fetch (Ifetch),1) Instruction Fetch (Ifetch),2) Register Read (Reg),2) Register Read (Reg),3) Execute (ALU),3) Execute (ALU),4) Data Memory Access (Dmem),4) Data Memory Access (Dmem),5) Register Write (Reg)5) Register Write (Reg)


6/37

6

Pipelined Instruction ExecutionPipelined Instruction Execution

I n s t r.

O r d e r

Time (clock cycles)

Reg A L U

DMemIfetch Reg

Reg A L U

DMemIfetch Reg

Reg A L U

DMemIfetch Reg

Reg A L U

DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5


7/377

Limits to pipeliningLimits to pipeliningHazardsHazards prevent next instruction fromprevent next instruction from

executing during its designated clock cycleexecuting during its designated clock cycle

Structural hazardsStructural hazards : attempt to use the same hardware: attempt to use the same hardwareto do two different things at onceto do two different things at onceData hazardsData hazards : Instruction depends on result of prior: Instruction depends on result of priorinstruction still in the pipelineinstruction still in the pipelineControl hazardsControl hazards : Caused by delay between the fetching: Caused by delay between the fetchingof instructions and decisions about changes in controlof instructions and decisions about changes in controlflow (branches and jumps).flow (branches and jumps).


8/378

Increasing Clock RateIncreasing Clock RatePipelining also used for thisPipelining also used for this

Clock rate determined by gate delaysClock rate determined by gate delays

Latchor

register

combinationallogic


9/379

2) The Principle of Locality2) The Principle of Locality The Principle of Locality: The Principle of Locality:

Programs access a relatively small portion of thePrograms access a relatively small portion of theaddress space. Also, reuse data.address space. Also, reuse data.

Two Different Types of Locality: Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is Temporal Locality (Locality in Time): If an item is

referenced, it will tend to be referenced again soonreferenced, it will tend to be referenced again soon(e.g., loops, reuse)(e.g., loops, reuse)Spatial Locality (Locality in Space): If an item isSpatial Locality (Locality in Space): If an item isreferenced, items whose addresses are close by tendreferenced, items whose addresses are close by tendto be referenced soonto be referenced soon(e.g., straight-line code, array access)(e.g., straight-line code, array access)

Last 30 years, HW relied on locality forLast 30 years, HW relied on locality formemory perf.memory perf.


10/3710

Levels of the Memory HierarchyLevels of the Memory Hierarchy

CPU Registers 100s Bytes300 500 ps (0.3-0.5 ns)

L1 and L2 Cache 10s-100s K Bytes~1 ns - ~10 ns$1000s/ GByte

Main Memory G Bytes80ns- 200ns~ $100/ GByte

Disk 10s T Bytes, 10 ms(10,000,000 ns)~ $1 / GByte

Capacity Access Time Cost

Tape infinitesec-min

~$1 / GByte

Registers

L1 Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

Staging Xfer Unit

prog./compiler1-8 bytes

cache cntl32-64 bytes

OS4K-8K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

L2 Cachecache cntl64-128 bytesBlocks


11/3711

3) Focus on the Common Case3) Focus on the Common CaseIn making a design trade-off, favor the frequent caseIn making a design trade-off, favor the frequent caseover the infrequent caseover the infrequent case

e.g., Instruction fetch and decode unit used moree.g., Instruction fetch and decode unit used morefrequently than multiplier, so optimize it 1stfrequently than multiplier, so optimize it 1ste.g., If database server has 50 disks / processor, storagee.g., If database server has 50 disks / processor, storagedependability dominates system dependability, so optimizedependability dominates system dependability, so optimizeit 1stit 1st

Frequent case is often simpler and can be done fasterFrequent case is often simpler and can be done fasterthan the infrequent casethan the infrequent case

e.g., overflow is rare when adding 2 numbers, so improvee.g., overflow is rare when adding 2 numbers, so improveperformance by optimizing more common case of noperformance by optimizing more common case of nooverflowoverflowMay slow down overflow, but overall performance improvedMay slow down overflow, but overall performance improved

by optimizing for the normal caseby optimizing for the normal caseWhat is frequent case and how much is performanceWhat is frequent case and how much is performanceimproved by making case faster => Amdahls Lawimproved by making case faster => Amdahls Law


12/3712

Validity of the single processor approach to achieving large scale computing capabilities, G. M. Amdahl,

AFIPS Conference Proceedings, pp. 483-485, April 1967 http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf

4) Amdahls Law (History, 1967)4) Amdahls Law (History, 1967)Historical contextHistorical context

Amdahl was demonstrating the continued validity of Amdahl was demonstrating the continued validity of the single processor approach and of the weaknessesthe single processor approach and of the weaknessesof the multiple processor approachof the multiple processor approachPaper contains no mathematical formulation, justPaper contains no mathematical formulation, justarguments and simulationarguments and simulation

The nature of this overhead appears to be sequential so The nature of this overhead appears to be sequential sothat it is unlikely to be amenable to parallel processingthat it is unlikely to be amenable to parallel processingtechniques.techniques. A fairly obvious conclusion which can be drawn at thisA fairly obvious conclusion which can be drawn at thispoint is that the effort expended on achieving high parallelpoint is that the effort expended on achieving high parallelperformance rates is wasted unless it is accompanied byperformance rates is wasted unless it is accompanied byachievements in sequential processing rates of very nearlyachievements in sequential processing rates of very nearlythe same magnitude.the same magnitude.

Nevertheless, it is of widespread applicabilityNevertheless, it is of widespread applicabilityin all kinds of situ ationsin all kinds of situ ations
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf


13/3713

SpeedupSpeedupBook shows two forms of speedup eqnBook shows two forms of speedup eqn

We will use the second because you getWe will use the second because you getspeedup factors like 2Xspeedup factors like 2X

oldoverallnew

ExTimeSpeedup ExTime

=

newoverall

old

ExTimeSpeedupExTime

=


14/3714

4) Amdahls Law4) Amdahls Law

( )enhanced

enhancedenhanced

newoldoverall

SpeedupFraction Fraction

1 ExTime

ExTime Speedup

+=

1

Best you could ever hope to do:

( )enhancedmaximum Fraction-1

1 Speedup =

( )+enhanced

enhancedenhancedoldnew Speedup

FractionFractionExTimeExTime 1


15/37

15

Amdahls Law exampleAmdahls Law exampleNew CPU 10X fasterNew CPU 10X faster

I/O bound server, so 60% time waitingI/O bound server, so 60% time waiting

( )

( )56.1

64.01

10

0.4 0.41

1

Speedup

Fraction Fraction1

1 Speedup

enhanced

enhancedenhanced

overall

==+

=

+=

Its human nature to be attracted by 10X faster, vs.keeping in perspective its just 1.6X faster


16/37

16

Amdahls Law for Multiple TasksAmdahls Law for Multiple Tasks

1

1

=

=

ii

i i

iavg

F R F R

[ ][ ]=

secondresults

11

secondresults

Fraction of resultsgenerated at this rate

Average execution rate(performance)

Note: Not fractionof time spent workingat this rate

Note : Not fractionof time spent workingat this rate

Bottleneckology: Evaluating Supercomputers, Jack Worlton, COMPCOM 85, pp. 405-406


17/37

17

ExampleExample

30% of results are generated at the rate of 1 MFLOPS,20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance in MFLOPS?What is the bottleneck?

30% of results are generated at the rate of 1 MFLOPS,

20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance in MFLOPS?What is the bottleneck?

MFLOPS08.35.32100

5.0230100

1005.0

102.0

13.0

1==

++=

++

= Ravg

%5.15.32

5.0%,2.6

5.322

%,3.925.32

30===

Bottleneck: the rate that consumes most of the time

0 0.2 0.4 0.6 0.8 1

1

1

=

=

ii

i i

i

avg

F R F

R


18/37

18

Another ExampleAnother Example

Which change is more effective on a certain machine: speeding up 10-foldthe floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, whichtake up 50% of total execution time?(Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)

Which change is more effective on a certain machine: speeding up 10-foldthe floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, whichtake up 50% of total execution time?(Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)

F sqrt = fraction of FP sqrt resultsR sqrt = rate of producing FP sqrt resultsF non-sqrt = fraction of non-sqrt resultsR non-sqrt = rate of producing non-sqrt resultsF fp = fraction of FP results

R fp = rate of producing FP resultsF non-fp = fraction of non-FP resultsR non-fp = rate of producing non-FP resultsR before = average rate of producing results before enhancementR after = average rate of producing results after enhancement

R FR F

R F

R F

fp

fp

fp-non

fp-non

sqrt

sqrt

sqrt-non

sqrt-non 4

=

=


19/37

19

Solution using Amdahls LawSolution using Amdahls Law

22.11.4

551

1.41

1.41

41.011

51

411

R R

R F

R 10FR

R F

R FR

before

after

sqrt-non

sqrt-non

sqrt

sqrtafter

sqrt-non

sqrt-non

sqrt

sqrt before

===

=+

=

+

=

=+

=

+

=

x

x

x x x

x x x

Improve FP sqrt only

33.15.1

2215.11

5.11

5.011

2111

R R R

F

R 2

FR

R F

R FR

before

after

fp-non

fp-non

fp

fpafter

fp-non

fp-non

fp

fp before

===

=+

=

+

=

=+

=

+

=

y

y

y y y

y y y

Improve all FP ops


20/37

20

Implications of Amdahls LawImplications of Amdahls LawImprovements provided by a feature limited by howImprovements provided by a feature limited by howoften feature is usedoften feature is usedAs stated, Amdahls Law is valid only if the systemAs stated, Amdahls Law is valid only if the systemalways works with exactly one of the ratesalways works with exactly one of the rates

Overlap between CPU and I/O operations? Amdahls Law asOverlap between CPU and I/O operations? Amdahls Law asgiven here is not applicablegiven here is not applicable

Bottleneck is the most promising target forBottleneck is the most promising target forimprovementsimprovements Make the common case fastMake the common case fastInfrequent events, even if they consume a lot of time, willInfrequent events, even if they consume a lot of time, willmake little difference to performancemake little difference to performance

Typical use: Change only one parameter of system, Typical use: Change only one parameter of system,and compute effect of this changeand compute effect of this change

The same program, with the same input data, should run The same program, with the same input data, should runon the machine in both caseson the machine in both cases


21/37

21

5) Processor Performance5) Processor Performance

sec secclock cycleCPU Time CPU Cycles for program clock cycle time

program program clock cy =

sec

sec

clock cycleCPU Cycles for program

programCPU Time

clock cycle program clock rate

=

or


22/37

22

CPI Clocks per InstructionCPI Clocks per Instruction

clock cycleCPU Cycles for program programclock cyles

CPI instruction instruction

instruction count

program

=

sec

sec

clock cycle instructionsCPI instruction count

instruction programCPU Time

clock cycle program clock rate

=


23/37

23

Details of CPIDetails of CPI

( )

( )

=

=

=

iii

iii

i

ii

ICPI

ICPI

ICPI

rateClock e performancCPU

countnInstructioCPI

countnInstructio CPI

We can break performance down intoindividual types of instructions (instructionof type i ) simplistic CPU


24/37


25/37

25

Processor Performance EqnProcessor Performance EqnHow can we improve performance?How can we improve performance?

Clockrate CPI Instruction counHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x

Clockrate CPI Instruction counHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x


26/37

26

Example 1Example 1

A LOAD/STORE machine has the characteristics shown below. We alsoobserve that 25% of the ALU operations directly use a loaded value that isnot used again. Thus we hope to improve things by adding new ALUinstructions that have one source operand in memory. The CPI of the newinstructions is 2. The only unpleasant consequence of this change is thatthe CPI of branch instructions will increase from 2 to 3. Overall, will CPUperformance increase?

A LOAD/STORE machine has the characteristics shown below. We alsoobserve that 25% of the ALU operations directly use a loaded value that isnot used again. Thus we hope to improve things by adding new ALUinstructions that have one source operand in memory. The CPI of the newinstructions is 2. The only unpleasant consequence of this change is thatthe CPI of branch instructions will increase from 2 to 3. Overall, will CPU

performance increase?

Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2


27/37

27

Example 1 (Solution)Example 1 (Solution)

Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2

TIC57.1

T1.57IC

timecycleClock CPIICtimeCPU

1.5720.24)0.12(0.2110.43CPI

=

=

=

=+++=

Before change

Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1

Loads (0.21-x)/(1-x) 2Stores 0.12/ (1- x ) 2Branches 0.24/ (1-x) 3Reg-mem ops x/(1-x) 2 TIC1.703

T908.1IC)-(1


908.10.8925

1.7025-1

30.242)0.12-(0.211)-(0.43 CPI

1075.040.43

=

=

=

==

++++=

==

x

x x x x

xAfter change

Since CPU time increases, change will not improve performance.


28/37

28

Example 2Example 2

A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?

A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?

Instruction type Frequency CPIALU ops 43% 1

Loads 21% 2Stores 12% 2Branches 24% 2


29/37

29

Example 2 (Solution)Example 2 (Solution)

Instruction type Frequency CPIALU ops 43% 1Loads 21% 2Stores 12% 2

Branches 24% 2

5.318101.57

MHz500 MIPS

IC1014.3

1021.57IC

timecycleClock CPIICtimeCPU1.5720.24)0.12(0.2110.43CPI

6

9-

9-

=

=

=

=

=

=+++=

Without optimization

Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1Loads 0.21/ (1-x) 2Stores 0.12/ (1- x ) 2Branches 0.24/(1-x) 2

0.2891073.1

MHz500 MIPS

IC1072.2

10273.1IC)-(1


73.10.7851.355

-120.24)0.12(0.211x)-(0.43

CPI

20.43

6

9-

9-

=

=

=

=

=

==

+++=

=

x

x

xWith optimization

Performance increases,but MIPS decreases!

f f


30/37

30

Performance of (Blocking) CachesPerformance of (Blocking) Caches

timecycleClock cyclesCPUtimeCPU =

timecycleClock cycles)stallMemorycycles(CPUtimeCPU +=

penaltyMissreferenceMemory

MissesnInstructio

referencesMemoryIC

penaltyMissnInstructio

Misses IC

penaltyMissmissesof Number cyclesstallMemory

=

=

=

CPIICcyclesCPU =

no cache misses!no cache misses!no cache misses!no cache misses!

with cache misses!with cache misses!with cache misses!with cache misses!

IC instruction count

l


31/37

31

ExampleExample

Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memory

accesses were cache hits?

Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memory

accesses were cache hits?

35.127.2

22502.0)4.01(2

CPI

penaltyMissrateMissnInstructio

refsMemoryCPI

timeCPUtimeCPU

missesno

misses

==

++=

+=

Why?

ll i d i f ll


32/37

32

Fallacies and PitfallsFallacies and PitfallsFallacies - commonly held misconceptionsFallacies - commonly held misconceptions

When discussing a fallacy, we try to give aWhen discussing a fallacy, we try to give acounterexample.counterexample.

Pitfalls - easily made mistakesPitfalls - easily made mistakesOften generalizations of principles true in limitedOften generalizations of principles true in limited

contextcontextWe show Fallacies and Pitfalls to help you avoid theseWe show Fallacies and Pitfalls to help you avoid theseerrorserrors

ll d f ll ( )F ll i d Pi f ll (1/3)


33/37

33

Fallacies and Pitfalls (1/3)Fallacies and Pitfalls (1/3)Fallacy: Benchmarks remain valid indefinitelyFallacy: Benchmarks remain valid indefinitely

Once a benchmark becomes popular, tremendousOnce a benchmark becomes popular, tremendouspressure to improve performance by targetedpressure to improve performance by targetedoptimizations or by aggressive interpretation of theoptimizations or by aggressive interpretation of therules for running the benchmark:rules for running the benchmark:benchmarksmanship.benchmarksmanship.

70 benchmarks from the 5 SPEC releases. 70% were70 benchmarks from the 5 SPEC releases. 70% weredropped from the next release since no longer usefuldropped from the next release since no longer useful

Pitfall: A single point of failurePitfall: A single point of failureRule of thumb for fault tolerant systems: make sureRule of thumb for fault tolerant systems: make surethat every component was redundant so that nothat every component was redundant so that nosingle component failure could bring down the wholesingle component failure could bring down the wholesystem (e.g, power supply)system (e.g, power supply)

ll d f ll ( )F ll i d Pi f ll (2/3)


34/37

34

Fallacies and Pitfalls (2/3)Fallacies and Pitfalls (2/3)Fallacy - Rated MTTF of disks is 1,200,000Fallacy - Rated MTTF of disks is 1,200,000

hours orhours or 140 years, so disks practically never fail140 years, so disks practically never failDisk lifetime is ~5 yearsDisk lifetime is ~5 years replace a diskreplace a diskevery 5 years; on average, 28 replacementevery 5 years; on average, 28 replacementcycles wouldn't fail (140 years long time!)cycles wouldn't fail (140 years long time!)Is that meaningful?Is that meaningful?Better unit: % that fail in 5 yearsBetter unit: % that fail in 5 years

Next slideNext slide

ll i d i f ll (3/3)F ll i d Pi f ll (3/3)


35/37

35

Fallacies and Pitfalls (3/3)Fallacies and Pitfalls (3/3)

So 3.7% will fail over 5 yearsSo 3.7% will fail over 5 yearsBut this is under pristine conditionsBut this is under pristine conditions

little vibration, narrow temperature rangelittle vibration, narrow temperature range no power failuresno power failures

Real world: 3% to 6% of SCSI drives fail per yearReal world: 3% to 6% of SCSI drives fail per year3400 - 6800 FIT or 150,000 - 300,000 hour MTTF [Gray & van Ingen3400 - 6800 FIT or 150,000 - 300,000 hour MTTF [Gray & van Ingen05]05]

3% to 7% of ATA drives fail per year3% to 7% of ATA drives fail per year3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen05]05]

Number of disks Time Period Failed DisksMTTF

=

1000 (5*365* 24 )37

1,200,000disks hours

Failed Diskshours

= =

N TiN Ti


36/37

36

Next TimeNext TimeInstruction Set ArchitectureInstruction Set Architecture

Appendix BAppendix B

R fR f


37/37

ReferencesReferencesG. M. Amdahl, Validity of the single processor G. M. Amdahl, Validity of the single processor approach to achieving large scale computingapproach to achieving large scale computingcapabilities, AFIPS Conference Proceedings, pp. 483-capabilities, AFIPS Conference Proceedings, pp. 483-485, April 1967485, April 1967

http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf

Documents

L03 Principles