Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Computer Science 246Computer Architecture

Spring 2010Harvard University

Instructor: Prof. David [email protected]


Lecture Outline

• Performance Metrics• Averaging• Amdahl’s Law• Benchmarks• The CPU Performance Equation

– Optimal Pipelining Case Study

• Modern Processor Analysis• Begin ISAs…


Performance Metrics

• Execution Time is often what we target• Throughput (tasks/sec) vs. latency (sec/task)• How do we decide the tasks? Benchmarks

– What the customer cares about, real applications– Representative programs (SPEC, SYSMARK, etc)– Kernels: Code fragments from real programs (Linpack)– Toy Programs: Quicksort, Sieve of Eratosthenes – Synthetic Programs: Just a representative instruction mix

(Whetsone, Dhrystone)

Bet

ter


Measuring Performance

• Total Execution Time:

• This is arithmetic mean– This should be used when measuring performance in

execution times (CPI)

1∑i=1

n

nTimei



• Weighted Execution Time:

• What if P1 and P2 are not run equally?

∑i=1

n

Weighti × Timei



• Normalized Execution Time• Normalize to reference machine• Can only use geometric mean (arithmetic mean

can vary depending on the reference machine)

• Problem: Ratio not Execution Time is the result

nn

iiimeRatioExecutionT∏

=1


Harmonic Mean: Motivation

• 30 mph for the first 10 miles• 90 mph for the next 10 miles• Average speed? (30+90)/2 = 60mph

• WRONG! Average speed = total distance / total time• 20/(10/30+10/90) = 45mph


Harmonic Mean

• Each program has O operations• n programs executed nO operations in Σ Ti

• Execution rate is thennO/ Σ Ti= n/ Σ (Ti/O)=n/ Σ 1/Pi

where 1/Pi is the rate of execution of program i

• Harmonic mean should be used when measuring performance in execution rates (IPC)


Amdahl’s Law(Law of Diminishing Returns)

• Very Intuitive – Make the Common case fast

Execution Time for task without enhancementExecution Time for task using enhancement

Execution timenew = Execution timeold ×

((1 - Fractionenhanced) + )Fractionenhanced

Speedupenhanced

Speedup =

OverallSpeedup


Amdahl’s Law Corollary

SpeedupOverall = 1

As SpeedupEnhanced >> 0, SpeedupOverall = 1

((1 - Fractionenhanced) + )Fractionenhanced

Speedupenhanced

(1 - Fractionenhanced)


Amdahl’s Law ExampleFractionenhanced=95%, SpeedupEnhanced=1.1x

SpeedupOverall =1/((1-.95)+(.95/1.1))=1.094

Fractionenhanced=5%, SpeedupEnhanced=10x

SpeedupOverall =1/((1-.05)+(.05/10))=1.047

Fractionenhanced=5%, SpeedupEnhanced=InfinitySpeedupOverall =1/(1-.05)=1.052

Make the commoncase fast!


MIPS

• MIPS = instruction count/(execution time x 106)= clock rate/(CPI x 106)

• Problems– ISAs are not equivalent, e.g. RISC vs. CISC

• 1 CISC instruction may equal many RISC!– Programs use different instruction mixes– May be ok when comparing same benchmarks, same ISA,

same compiler, same OS


MFLOPS

• Same as MIPS, just FP ops• Not useful either

– FP-intensive apps needed– Traditionally, FP ops were slow, INT can be ignored– BUT, now memory ops can be the slowest!

• “Peak MFLOPS” is a common marketing fallacy– Basically, it just says #FP-pipes X Clock Rate


GHz

• Is this a metric? Maybe as good as the others…• One number, no benchmarks, what can be better?• Many designs are frequency driven

Processor Clock Rate SPEC FP2000IBM POWER3 450 MHz 434Intel PIII 1.4 GHz 456Intel Pentium 4 2.4 GHz 833Itanium-2 1.0 GHz 1356


Benchmark Suites

• SPEC CPU2000 (int and float) (Desktop, Server)• EEMBC (“embassy”), SPECjvm (Embedded)• TPC-C, TPC-H, SPECjbb, ECperf (Server)


SPEC CPU2000: Integer Benchmarks

164.gzip C Compression175.vpr C FPGA Circuit Placement and Routing176.gcc C C Programming Language Compiler181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess197.parser C Word Processing252.eon C++ Computer Visualization253.perlbmk C PERL Programming Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database256.bzip2 C Compression300.twolf C Place and Route Simulator


SPEC CPU2000: Floating Point Benchmarks

168.wupwise Fortran 77 Physics / Quantum Chromodynamics171.swim Fortran 77 Shallow Water Modeling172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations177.mesa C 3-D Graphics Library178.galgel Fortran 90 Computational Fluid Dynamics179.art C Image Recognition / Neural Networks183.equake C Seismic Wave Propagation Simulation187.facerec Fortran 90 Image Processing: Face Recognition188.ammp C Computational Chemistry189.lucas Fortran 90 Number Theory / Primality Testing191.fma3d Fortran 90 Finite-element Crash Simulation200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design301.apsi Fortran 77 Meteorology: Pollutant Distribution


Server Benchmarks

• TPC-C (Online-Transaction Processing, OLTP)– Models a simple order-entry application– Thousands of concurrent database accesses

• TPC-H (Ad-hoc, decision support)– Data warehouse, backend analysis tools for data

System # / Processor tpm $/tpmFujitsu PrimePower 128, 563MHz SPARC64 455K $28.58

HP SuperDome 64, 875MHz PA8700 423K $15.64

IBM p690 32, 1300MHz POWER4 403K $17.80


CPU Performance Equation

• Execution Time = seconds/program

Program Compiler (Scheduling) TechnologyArchitecture (ISA) Organization (uArch) Physical DesignCompiler Microarchitects Circuit Designers

instructions cycles secondsprogram instruction cycle


Common Architecture Trick

• Instructions/Program (Path-length) is constant– Same benchmark, same compiler– Ok usually, but for some ideas compiler may change

• Seconds/Cycle (Cycle-time) is constant– “My tweak won’t impact cycle-time”– Often a bad assumption– Current designs are ~12-20FO4 Inverter Delays per cycle

• Just focus on Cycles/Instruction (CPI or IPC)• Many architecture studies do just this!


Instruction Set Architecture

“Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.”

IBM, Introducing the IBM 360 (1964)

• The ISA defines:– Operations that the processor can execute– Data Transfer mechanisms + how to access data– Control Mechanisms (branch, jump, etc)– “Contract” between programmer/compiler + HW


Classifying ISAs


Stack

• Architectures with implicit “stack”– Acts as source(s) and/or destination, TOS is implicit– Push and Pop operations have 1 explicit operand

• Example: C = A + B– Push A // S[++TOS] = Mem[A]– Push B // S[++TOS] = Mem[B]– Add // Tem1 = S[TOS--], Tem2 = S[TOS--] ,

S[++TOS] = Tem1 + Tem2– Pop C // Mem[C] = S[TOS--]

• x86 FP uses stack (complicates pipelining)


Accumulator

• Architectures with one implicit register– Acts as source and/or destination– One other source explicit

• Example: C = A + B– Load A // (Acc)umulator <= A– Add B // Acc <= Acc + B– Store C // C <= Acc

• Accumulator implicit, bottleneck?• x86 uses accumulator concepts for integer


Register

• Most common approach– Fast, temporary storage (small)– Explicit operands (register IDs)

• Example: C = A + BRegister-memory load/storeLoad R1, A Load R1, AAdd R3, R1, B Load R2, BStore R3, C Add R3, R1, R2

Store R3, C

• All RISC ISAs are load/store• IBM 360, Intel x86, Moto 68K are register-memory


Common Addressing Modes

Base/DisplacementLoad R4, 100(R1)Register Indirect Load R4, (R1)Indexed Load R4, (R1+R2)Direct Load R4, (1001)Memory Indirect Load R4, @(R3)Autoincrement Load R4, (R2)+Scaled Load R4, 100(R2)[R3]


What leads to a good/bad ISA?

• Ease of Implementation (Job of Architect/Designer)– Does the ISA lead itself to efficient implementations?

• Ease of Programming (Job of Programmer/Compiler)– Can the compiler use the ISA effectively?

• Future Compatibility– ISAs may last 30+yrs– Special Features, Address range, etc. need to be thought out


Implementation Concerns

• Simple Decoding (fixed length)• Compactness (variable length)• Simple Instructions (no load/update)

– Things that get microcoded these days– Deterministic Latencies are key!– Instructions with multiple exceptions are difficult

• More/Less registers?– Slower register files, decoding, better compilers

• Condition codes/Flags (scheduling!)


Programmability

• 1960s, early 70s– Code was mostly hand-coded

• Late 70s, Early 80s– Most code was compiled, but hand-coded was better

• Mid-80s to Present– Most code is compiled and almost as good as assembly

• Why?


Programmability: 70s, Early 80s “Closing the Semantic Gap”

• High-level languages match assembly languages• Efforts for computers to execute HLL directly

– e.g. LISP Machine• Hardware Type Checking. Special type bits let the type be

checked efficiently at run-time• Hardware Garbage Collection• Fast Function Calls• Efficient Representation of Lists

• Never worked out…“Semantic Clash”– Too many HLLs? C was more popular? – Is this coming back with Java? (Sun’s picoJava)


Programmability: 1980s … 2000s“In the Compiler We Trust”

• Wulf: Primitives not Solutions– Compilers cannot effectively use complex

instructions– Synthesize programs from primitives

• Regularity: same behavior in all contexts – No odd cases – things should be intuitive

• Orthogonality: – Data type independent of addressing mode– Addressing mode independent of operation performed


ISA Compatibility“In Computer Architecture, no good idea ever goes unpunished.”

Marty Hopkins, IBM Fellow• Never abandon existing code base• Extremely difficult to introduce a new ISA

– Alpha failed, IA64 is struggling, best solution may not win• x86 most popular, is the least liked!• Hard to think ahead, but…

– ISA tweak may buy 5-10% today– 10 years later it may buy nothing, but must be implemented

• Register windows, delay branches


CISC vs. RISC

• Debate raged from early 80s through 90s• Now it is fairly irrelevant• Despite this Intel (x86 => Itanium) and

DEC/Compaq (VAX => Alpha) have tried to switch

• Research in the late 70s/early 80s led to RISC– IBM 801 -- John Cocke – mid 70s– Berkeley RISC-1 (Patterson)– Stanford MIPS (Hennessy)


VAX

• 32-bit ISA, instructions could be huge (up to 321 bytes), 16 GPRs

• Operated on data types from 8 to 128-bits, decimals, strings

• Orthogonal, memory-to-memory, all operand modes supported

• Hundreds of special instructions• Simple compiler, hand-coding was common• CPI was over 10!


x86

• Variable length ISA (1-16 bytes)• FP Operand Stack• 2 operand instructions (extended accumulator)

– Register-register and register-memory support

• Scaled addressing modes

• Has been extended many times (as AMD did with x86-64)• Intel, initially went to IA64, now backtracked (EM64T)


RISC vs. CISC Arguments

• RISC– Simple Implementation

• Load/store, fixed-format 32-bit instructions, efficient pipelines

– Lower CPI– Compilers do a lot of the hard work

• MIPS = Microprocessor without Interlocked Pipelined Stages

• CISC– Simple Compilers (assists hand-coding, many

addressing modes, many instructions)– Code Density


MIPS/VAX Comparison


After the dust settled

• Turns out it doesn’t matter much• Can decode CISC instructions into internal “micro-

ISA”– This takes a couple of extra cycles (PLA

implementation) and a few hundred thousand transistors– In 20 stage pipelines, 55M tx processors this is minimal– Pentium 4 caches these micro-Ops

• Actually may have some advantages– External ISA for compatibility, internal ISA can be

tweaked each generation (Transmeta)

Documents

Computer Science 246 Computer Architecture