Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Computer Science 246David Brooks
Computer Science 246Computer Architecture
Spring 2010Harvard University
Instructor: Prof. David [email protected]
Computer Science 246David Brooks
Lecture Outline
• Performance Metrics• Averaging• Amdahl’s Law• Benchmarks• The CPU Performance Equation
– Optimal Pipelining Case Study
• Modern Processor Analysis• Begin ISAs…
Computer Science 246David Brooks
Performance Metrics
• Execution Time is often what we target• Throughput (tasks/sec) vs. latency (sec/task)• How do we decide the tasks? Benchmarks
– What the customer cares about, real applications– Representative programs (SPEC, SYSMARK, etc)– Kernels: Code fragments from real programs (Linpack)– Toy Programs: Quicksort, Sieve of Eratosthenes – Synthetic Programs: Just a representative instruction mix
(Whetsone, Dhrystone)
Bet
ter
Computer Science 246David Brooks
Measuring Performance
• Total Execution Time:
• This is arithmetic mean– This should be used when measuring performance in
execution times (CPI)
1∑i=1
n
nTimei
Computer Science 246David Brooks
Measuring Performance
• Weighted Execution Time:
• What if P1 and P2 are not run equally?
∑i=1
n
Weighti × Timei
Computer Science 246David Brooks
Measuring Performance
• Normalized Execution Time• Normalize to reference machine• Can only use geometric mean (arithmetic mean
can vary depending on the reference machine)
• Problem: Ratio not Execution Time is the result
nn
iiimeRatioExecutionT∏
=1
Computer Science 246David Brooks
Harmonic Mean: Motivation
• 30 mph for the first 10 miles• 90 mph for the next 10 miles• Average speed? (30+90)/2 = 60mph
• WRONG! Average speed = total distance / total time• 20/(10/30+10/90) = 45mph
Computer Science 246David Brooks
Harmonic Mean
• Each program has O operations• n programs executed nO operations in Σ Ti
• Execution rate is thennO/ Σ Ti= n/ Σ (Ti/O)=n/ Σ 1/Pi
where 1/Pi is the rate of execution of program i
• Harmonic mean should be used when measuring performance in execution rates (IPC)
Computer Science 246David Brooks
Amdahl’s Law(Law of Diminishing Returns)
• Very Intuitive – Make the Common case fast
Execution Time for task without enhancementExecution Time for task using enhancement
Execution timenew = Execution timeold ×
((1 - Fractionenhanced) + )Fractionenhanced
Speedupenhanced
Speedup =
OverallSpeedup
Computer Science 246David Brooks
Amdahl’s Law Corollary
SpeedupOverall = 1
As SpeedupEnhanced >> 0, SpeedupOverall = 1
((1 - Fractionenhanced) + )Fractionenhanced
Speedupenhanced
(1 - Fractionenhanced)
Computer Science 246David Brooks
Amdahl’s Law ExampleFractionenhanced=95%, SpeedupEnhanced=1.1x
SpeedupOverall =1/((1-.95)+(.95/1.1))=1.094
Fractionenhanced=5%, SpeedupEnhanced=10x
SpeedupOverall =1/((1-.05)+(.05/10))=1.047
Fractionenhanced=5%, SpeedupEnhanced=InfinitySpeedupOverall =1/(1-.05)=1.052
Make the commoncase fast!
Computer Science 246David Brooks
MIPS
• MIPS = instruction count/(execution time x 106)= clock rate/(CPI x 106)
• Problems– ISAs are not equivalent, e.g. RISC vs. CISC
• 1 CISC instruction may equal many RISC!– Programs use different instruction mixes– May be ok when comparing same benchmarks, same ISA,
same compiler, same OS
Computer Science 246David Brooks
MFLOPS
• Same as MIPS, just FP ops• Not useful either
– FP-intensive apps needed– Traditionally, FP ops were slow, INT can be ignored– BUT, now memory ops can be the slowest!
• “Peak MFLOPS” is a common marketing fallacy– Basically, it just says #FP-pipes X Clock Rate
Computer Science 246David Brooks
GHz
• Is this a metric? Maybe as good as the others…• One number, no benchmarks, what can be better?• Many designs are frequency driven
Processor Clock Rate SPEC FP2000IBM POWER3 450 MHz 434Intel PIII 1.4 GHz 456Intel Pentium 4 2.4 GHz 833Itanium-2 1.0 GHz 1356
Computer Science 246David Brooks
Benchmark Suites
• SPEC CPU2000 (int and float) (Desktop, Server)• EEMBC (“embassy”), SPECjvm (Embedded)• TPC-C, TPC-H, SPECjbb, ECperf (Server)
Computer Science 246David Brooks
SPEC CPU2000: Integer Benchmarks
164.gzip C Compression175.vpr C FPGA Circuit Placement and Routing176.gcc C C Programming Language Compiler181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess197.parser C Word Processing252.eon C++ Computer Visualization253.perlbmk C PERL Programming Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database256.bzip2 C Compression300.twolf C Place and Route Simulator
Computer Science 246David Brooks
SPEC CPU2000: Floating Point Benchmarks
168.wupwise Fortran 77 Physics / Quantum Chromodynamics171.swim Fortran 77 Shallow Water Modeling172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations177.mesa C 3-D Graphics Library178.galgel Fortran 90 Computational Fluid Dynamics179.art C Image Recognition / Neural Networks183.equake C Seismic Wave Propagation Simulation187.facerec Fortran 90 Image Processing: Face Recognition188.ammp C Computational Chemistry189.lucas Fortran 90 Number Theory / Primality Testing191.fma3d Fortran 90 Finite-element Crash Simulation200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design301.apsi Fortran 77 Meteorology: Pollutant Distribution
Computer Science 246David Brooks
Server Benchmarks
• TPC-C (Online-Transaction Processing, OLTP)– Models a simple order-entry application– Thousands of concurrent database accesses
• TPC-H (Ad-hoc, decision support)– Data warehouse, backend analysis tools for data
System # / Processor tpm $/tpmFujitsu PrimePower 128, 563MHz SPARC64 455K $28.58
HP SuperDome 64, 875MHz PA8700 423K $15.64
IBM p690 32, 1300MHz POWER4 403K $17.80
Computer Science 246David Brooks
CPU Performance Equation
• Execution Time = seconds/program
Program Compiler (Scheduling) TechnologyArchitecture (ISA) Organization (uArch) Physical DesignCompiler Microarchitects Circuit Designers
instructions cycles secondsprogram instruction cycle
Computer Science 246David Brooks
Common Architecture Trick
• Instructions/Program (Path-length) is constant– Same benchmark, same compiler– Ok usually, but for some ideas compiler may change
• Seconds/Cycle (Cycle-time) is constant– “My tweak won’t impact cycle-time”– Often a bad assumption– Current designs are ~12-20FO4 Inverter Delays per cycle
• Just focus on Cycles/Instruction (CPI or IPC)• Many architecture studies do just this!
Computer Science 246David Brooks
Instruction Set Architecture
“Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.”
IBM, Introducing the IBM 360 (1964)
• The ISA defines:– Operations that the processor can execute– Data Transfer mechanisms + how to access data– Control Mechanisms (branch, jump, etc)– “Contract” between programmer/compiler + HW
Computer Science 246David Brooks
Classifying ISAs
Computer Science 246David Brooks
Stack
• Architectures with implicit “stack”– Acts as source(s) and/or destination, TOS is implicit– Push and Pop operations have 1 explicit operand
• Example: C = A + B– Push A // S[++TOS] = Mem[A]– Push B // S[++TOS] = Mem[B]– Add // Tem1 = S[TOS--], Tem2 = S[TOS--] ,
S[++TOS] = Tem1 + Tem2– Pop C // Mem[C] = S[TOS--]
• x86 FP uses stack (complicates pipelining)
Computer Science 246David Brooks
Accumulator
• Architectures with one implicit register– Acts as source and/or destination– One other source explicit
• Example: C = A + B– Load A // (Acc)umulator <= A– Add B // Acc <= Acc + B– Store C // C <= Acc
• Accumulator implicit, bottleneck?• x86 uses accumulator concepts for integer
Computer Science 246David Brooks
Register
• Most common approach– Fast, temporary storage (small)– Explicit operands (register IDs)
• Example: C = A + BRegister-memory load/storeLoad R1, A Load R1, AAdd R3, R1, B Load R2, BStore R3, C Add R3, R1, R2
Store R3, C
• All RISC ISAs are load/store• IBM 360, Intel x86, Moto 68K are register-memory
Computer Science 246David Brooks
Common Addressing Modes
Base/DisplacementLoad R4, 100(R1)Register Indirect Load R4, (R1)Indexed Load R4, (R1+R2)Direct Load R4, (1001)Memory Indirect Load R4, @(R3)Autoincrement Load R4, (R2)+Scaled Load R4, 100(R2)[R3]
Computer Science 246David Brooks
What leads to a good/bad ISA?
• Ease of Implementation (Job of Architect/Designer)– Does the ISA lead itself to efficient implementations?
• Ease of Programming (Job of Programmer/Compiler)– Can the compiler use the ISA effectively?
• Future Compatibility– ISAs may last 30+yrs– Special Features, Address range, etc. need to be thought out
Computer Science 246David Brooks
Implementation Concerns
• Simple Decoding (fixed length)• Compactness (variable length)• Simple Instructions (no load/update)
– Things that get microcoded these days– Deterministic Latencies are key!– Instructions with multiple exceptions are difficult
• More/Less registers?– Slower register files, decoding, better compilers
• Condition codes/Flags (scheduling!)
Computer Science 246David Brooks
Programmability
• 1960s, early 70s– Code was mostly hand-coded
• Late 70s, Early 80s– Most code was compiled, but hand-coded was better
• Mid-80s to Present– Most code is compiled and almost as good as assembly
• Why?
Computer Science 246David Brooks
Programmability: 70s, Early 80s “Closing the Semantic Gap”
• High-level languages match assembly languages• Efforts for computers to execute HLL directly
– e.g. LISP Machine• Hardware Type Checking. Special type bits let the type be
checked efficiently at run-time• Hardware Garbage Collection• Fast Function Calls• Efficient Representation of Lists
• Never worked out…“Semantic Clash”– Too many HLLs? C was more popular? – Is this coming back with Java? (Sun’s picoJava)
Computer Science 246David Brooks
Programmability: 1980s … 2000s“In the Compiler We Trust”
• Wulf: Primitives not Solutions– Compilers cannot effectively use complex
instructions– Synthesize programs from primitives
• Regularity: same behavior in all contexts – No odd cases – things should be intuitive
• Orthogonality: – Data type independent of addressing mode– Addressing mode independent of operation performed
Computer Science 246David Brooks
ISA Compatibility“In Computer Architecture, no good idea ever goes unpunished.”
Marty Hopkins, IBM Fellow• Never abandon existing code base• Extremely difficult to introduce a new ISA
– Alpha failed, IA64 is struggling, best solution may not win• x86 most popular, is the least liked!• Hard to think ahead, but…
– ISA tweak may buy 5-10% today– 10 years later it may buy nothing, but must be implemented
• Register windows, delay branches
Computer Science 246David Brooks
CISC vs. RISC
• Debate raged from early 80s through 90s• Now it is fairly irrelevant• Despite this Intel (x86 => Itanium) and
DEC/Compaq (VAX => Alpha) have tried to switch
• Research in the late 70s/early 80s led to RISC– IBM 801 -- John Cocke – mid 70s– Berkeley RISC-1 (Patterson)– Stanford MIPS (Hennessy)
Computer Science 246David Brooks
VAX
• 32-bit ISA, instructions could be huge (up to 321 bytes), 16 GPRs
• Operated on data types from 8 to 128-bits, decimals, strings
• Orthogonal, memory-to-memory, all operand modes supported
• Hundreds of special instructions• Simple compiler, hand-coding was common• CPI was over 10!
Computer Science 246David Brooks
x86
• Variable length ISA (1-16 bytes)• FP Operand Stack• 2 operand instructions (extended accumulator)
– Register-register and register-memory support
• Scaled addressing modes
• Has been extended many times (as AMD did with x86-64)• Intel, initially went to IA64, now backtracked (EM64T)
Computer Science 246David Brooks
RISC vs. CISC Arguments
• RISC– Simple Implementation
• Load/store, fixed-format 32-bit instructions, efficient pipelines
– Lower CPI– Compilers do a lot of the hard work
• MIPS = Microprocessor without Interlocked Pipelined Stages
• CISC– Simple Compilers (assists hand-coding, many
addressing modes, many instructions)– Code Density
Computer Science 146David Brooks
MIPS/VAX Comparison
Computer Science 246David Brooks
After the dust settled
• Turns out it doesn’t matter much• Can decode CISC instructions into internal “micro-
ISA”– This takes a couple of extra cycles (PLA
implementation) and a few hundred thousand transistors– In 20 stage pipelines, 55M tx processors this is minimal– Pentium 4 caches these micro-Ops
• Actually may have some advantages– External ISA for compatibility, internal ISA can be
tweaked each generation (Transmeta)