38
Computer Science 246 David Brooks Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected]

Computer Science 246 Computer Architecture

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Computer Science 246Computer Architecture

Spring 2010Harvard University

Instructor: Prof. David [email protected]

Page 2: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Lecture Outline

• Performance Metrics• Averaging• Amdahl’s Law• Benchmarks• The CPU Performance Equation

– Optimal Pipelining Case Study

• Modern Processor Analysis• Begin ISAs…

Page 3: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Performance Metrics

• Execution Time is often what we target• Throughput (tasks/sec) vs. latency (sec/task)• How do we decide the tasks? Benchmarks

– What the customer cares about, real applications– Representative programs (SPEC, SYSMARK, etc)– Kernels: Code fragments from real programs (Linpack)– Toy Programs: Quicksort, Sieve of Eratosthenes – Synthetic Programs: Just a representative instruction mix

(Whetsone, Dhrystone)

Bet

ter

Page 4: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Measuring Performance

• Total Execution Time:

• This is arithmetic mean– This should be used when measuring performance in

execution times (CPI)

1∑i=1

n

nTimei

Page 5: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Measuring Performance

• Weighted Execution Time:

• What if P1 and P2 are not run equally?

∑i=1

n

Weighti × Timei

Page 6: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Measuring Performance

• Normalized Execution Time• Normalize to reference machine• Can only use geometric mean (arithmetic mean

can vary depending on the reference machine)

• Problem: Ratio not Execution Time is the result

nn

iiimeRatioExecutionT∏

=1

Page 7: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Harmonic Mean: Motivation

• 30 mph for the first 10 miles• 90 mph for the next 10 miles• Average speed? (30+90)/2 = 60mph

• WRONG! Average speed = total distance / total time• 20/(10/30+10/90) = 45mph

Page 8: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Harmonic Mean

• Each program has O operations• n programs executed nO operations in Σ Ti

• Execution rate is thennO/ Σ Ti= n/ Σ (Ti/O)=n/ Σ 1/Pi

where 1/Pi is the rate of execution of program i

• Harmonic mean should be used when measuring performance in execution rates (IPC)

Page 9: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Amdahl’s Law(Law of Diminishing Returns)

• Very Intuitive – Make the Common case fast

Execution Time for task without enhancementExecution Time for task using enhancement

Execution timenew = Execution timeold ×

((1 - Fractionenhanced) + )Fractionenhanced

Speedupenhanced

Speedup =

OverallSpeedup

Page 10: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Amdahl’s Law Corollary

SpeedupOverall = 1

As SpeedupEnhanced >> 0, SpeedupOverall = 1

((1 - Fractionenhanced) + )Fractionenhanced

Speedupenhanced

(1 - Fractionenhanced)

Page 11: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Amdahl’s Law ExampleFractionenhanced=95%, SpeedupEnhanced=1.1x

SpeedupOverall =1/((1-.95)+(.95/1.1))=1.094

Fractionenhanced=5%, SpeedupEnhanced=10x

SpeedupOverall =1/((1-.05)+(.05/10))=1.047

Fractionenhanced=5%, SpeedupEnhanced=InfinitySpeedupOverall =1/(1-.05)=1.052

Make the commoncase fast!

Page 12: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

MIPS

• MIPS = instruction count/(execution time x 106)= clock rate/(CPI x 106)

• Problems– ISAs are not equivalent, e.g. RISC vs. CISC

• 1 CISC instruction may equal many RISC!– Programs use different instruction mixes– May be ok when comparing same benchmarks, same ISA,

same compiler, same OS

Page 13: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

MFLOPS

• Same as MIPS, just FP ops• Not useful either

– FP-intensive apps needed– Traditionally, FP ops were slow, INT can be ignored– BUT, now memory ops can be the slowest!

• “Peak MFLOPS” is a common marketing fallacy– Basically, it just says #FP-pipes X Clock Rate

Page 14: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

GHz

• Is this a metric? Maybe as good as the others…• One number, no benchmarks, what can be better?• Many designs are frequency driven

Processor Clock Rate SPEC FP2000IBM POWER3 450 MHz 434Intel PIII 1.4 GHz 456Intel Pentium 4 2.4 GHz 833Itanium-2 1.0 GHz 1356

Page 15: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Benchmark Suites

• SPEC CPU2000 (int and float) (Desktop, Server)• EEMBC (“embassy”), SPECjvm (Embedded)• TPC-C, TPC-H, SPECjbb, ECperf (Server)

Page 16: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

SPEC CPU2000: Integer Benchmarks

164.gzip C Compression175.vpr C FPGA Circuit Placement and Routing176.gcc C C Programming Language Compiler181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess197.parser C Word Processing252.eon C++ Computer Visualization253.perlbmk C PERL Programming Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database256.bzip2 C Compression300.twolf C Place and Route Simulator

Page 17: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

SPEC CPU2000: Floating Point Benchmarks

168.wupwise Fortran 77 Physics / Quantum Chromodynamics171.swim Fortran 77 Shallow Water Modeling172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations177.mesa C 3-D Graphics Library178.galgel Fortran 90 Computational Fluid Dynamics179.art C Image Recognition / Neural Networks183.equake C Seismic Wave Propagation Simulation187.facerec Fortran 90 Image Processing: Face Recognition188.ammp C Computational Chemistry189.lucas Fortran 90 Number Theory / Primality Testing191.fma3d Fortran 90 Finite-element Crash Simulation200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design301.apsi Fortran 77 Meteorology: Pollutant Distribution

Page 18: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Server Benchmarks

• TPC-C (Online-Transaction Processing, OLTP)– Models a simple order-entry application– Thousands of concurrent database accesses

• TPC-H (Ad-hoc, decision support)– Data warehouse, backend analysis tools for data

System # / Processor tpm $/tpmFujitsu PrimePower 128, 563MHz SPARC64 455K $28.58

HP SuperDome 64, 875MHz PA8700 423K $15.64

IBM p690 32, 1300MHz POWER4 403K $17.80

Page 19: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

CPU Performance Equation

• Execution Time = seconds/program

Program Compiler (Scheduling) TechnologyArchitecture (ISA) Organization (uArch) Physical DesignCompiler Microarchitects Circuit Designers

instructions cycles secondsprogram instruction cycle

Page 20: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Common Architecture Trick

• Instructions/Program (Path-length) is constant– Same benchmark, same compiler– Ok usually, but for some ideas compiler may change

• Seconds/Cycle (Cycle-time) is constant– “My tweak won’t impact cycle-time”– Often a bad assumption– Current designs are ~12-20FO4 Inverter Delays per cycle

• Just focus on Cycles/Instruction (CPI or IPC)• Many architecture studies do just this!

Page 21: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Instruction Set Architecture

“Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.”

IBM, Introducing the IBM 360 (1964)

• The ISA defines:– Operations that the processor can execute– Data Transfer mechanisms + how to access data– Control Mechanisms (branch, jump, etc)– “Contract” between programmer/compiler + HW

Page 22: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Classifying ISAs

Page 23: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Stack

• Architectures with implicit “stack”– Acts as source(s) and/or destination, TOS is implicit– Push and Pop operations have 1 explicit operand

• Example: C = A + B– Push A // S[++TOS] = Mem[A]– Push B // S[++TOS] = Mem[B]– Add // Tem1 = S[TOS--], Tem2 = S[TOS--] ,

S[++TOS] = Tem1 + Tem2– Pop C // Mem[C] = S[TOS--]

• x86 FP uses stack (complicates pipelining)

Page 24: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Accumulator

• Architectures with one implicit register– Acts as source and/or destination– One other source explicit

• Example: C = A + B– Load A // (Acc)umulator <= A– Add B // Acc <= Acc + B– Store C // C <= Acc

• Accumulator implicit, bottleneck?• x86 uses accumulator concepts for integer

Page 25: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Register

• Most common approach– Fast, temporary storage (small)– Explicit operands (register IDs)

• Example: C = A + BRegister-memory load/storeLoad R1, A Load R1, AAdd R3, R1, B Load R2, BStore R3, C Add R3, R1, R2

Store R3, C

• All RISC ISAs are load/store• IBM 360, Intel x86, Moto 68K are register-memory

Page 26: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Common Addressing Modes

Base/DisplacementLoad R4, 100(R1)Register Indirect Load R4, (R1)Indexed Load R4, (R1+R2)Direct Load R4, (1001)Memory Indirect Load R4, @(R3)Autoincrement Load R4, (R2)+Scaled Load R4, 100(R2)[R3]

Page 27: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

What leads to a good/bad ISA?

• Ease of Implementation (Job of Architect/Designer)– Does the ISA lead itself to efficient implementations?

• Ease of Programming (Job of Programmer/Compiler)– Can the compiler use the ISA effectively?

• Future Compatibility– ISAs may last 30+yrs– Special Features, Address range, etc. need to be thought out

Page 28: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Implementation Concerns

• Simple Decoding (fixed length)• Compactness (variable length)• Simple Instructions (no load/update)

– Things that get microcoded these days– Deterministic Latencies are key!– Instructions with multiple exceptions are difficult

• More/Less registers?– Slower register files, decoding, better compilers

• Condition codes/Flags (scheduling!)

Page 29: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Programmability

• 1960s, early 70s– Code was mostly hand-coded

• Late 70s, Early 80s– Most code was compiled, but hand-coded was better

• Mid-80s to Present– Most code is compiled and almost as good as assembly

• Why?

Page 30: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Programmability: 70s, Early 80s “Closing the Semantic Gap”

• High-level languages match assembly languages• Efforts for computers to execute HLL directly

– e.g. LISP Machine• Hardware Type Checking. Special type bits let the type be

checked efficiently at run-time• Hardware Garbage Collection• Fast Function Calls• Efficient Representation of Lists

• Never worked out…“Semantic Clash”– Too many HLLs? C was more popular? – Is this coming back with Java? (Sun’s picoJava)

Page 31: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

Programmability: 1980s … 2000s“In the Compiler We Trust”

• Wulf: Primitives not Solutions– Compilers cannot effectively use complex

instructions– Synthesize programs from primitives

• Regularity: same behavior in all contexts – No odd cases – things should be intuitive

• Orthogonality: – Data type independent of addressing mode– Addressing mode independent of operation performed

Page 32: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

ISA Compatibility“In Computer Architecture, no good idea ever goes unpunished.”

Marty Hopkins, IBM Fellow• Never abandon existing code base• Extremely difficult to introduce a new ISA

– Alpha failed, IA64 is struggling, best solution may not win• x86 most popular, is the least liked!• Hard to think ahead, but…

– ISA tweak may buy 5-10% today– 10 years later it may buy nothing, but must be implemented

• Register windows, delay branches

Page 33: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

CISC vs. RISC

• Debate raged from early 80s through 90s• Now it is fairly irrelevant• Despite this Intel (x86 => Itanium) and

DEC/Compaq (VAX => Alpha) have tried to switch

• Research in the late 70s/early 80s led to RISC– IBM 801 -- John Cocke – mid 70s– Berkeley RISC-1 (Patterson)– Stanford MIPS (Hennessy)

Page 34: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

VAX

• 32-bit ISA, instructions could be huge (up to 321 bytes), 16 GPRs

• Operated on data types from 8 to 128-bits, decimals, strings

• Orthogonal, memory-to-memory, all operand modes supported

• Hundreds of special instructions• Simple compiler, hand-coding was common• CPI was over 10!

Page 35: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

x86

• Variable length ISA (1-16 bytes)• FP Operand Stack• 2 operand instructions (extended accumulator)

– Register-register and register-memory support

• Scaled addressing modes

• Has been extended many times (as AMD did with x86-64)• Intel, initially went to IA64, now backtracked (EM64T)

Page 36: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

RISC vs. CISC Arguments

• RISC– Simple Implementation

• Load/store, fixed-format 32-bit instructions, efficient pipelines

– Lower CPI– Compilers do a lot of the hard work

• MIPS = Microprocessor without Interlocked Pipelined Stages

• CISC– Simple Compilers (assists hand-coding, many

addressing modes, many instructions)– Code Density

Page 37: Computer Science 246 Computer Architecture

Computer Science 146David Brooks

MIPS/VAX Comparison

Page 38: Computer Science 246 Computer Architecture

Computer Science 246David Brooks

After the dust settled

• Turns out it doesn’t matter much• Can decode CISC instructions into internal “micro-

ISA”– This takes a couple of extra cycles (PLA

implementation) and a few hundred thousand transistors– In 20 stage pipelines, 55M tx processors this is minimal– Pentium 4 caches these micro-Ops

• Actually may have some advantages– External ISA for compatibility, internal ISA can be

tweaked each generation (Transmeta)