CS465Lec1n.m,,nmnm knlknlknklnklnknknl

Embed Size (px)

Citation preview

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    1/63

    CS 465Computer ArchitectureFall 2009

    Lecture 01: Introduction

    Daniel Barbar ( cs.gmu.edu/~dbarbara)[Adapted from Computer Organization and Design,

    Patterson & Hennessy, 2005, UCB]

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    2/63

    Course Administration

    Instructor: Daniel Barbar

    [email protected] Eng. Bldg.

    Text: Required: Computer Organization & DesignThe Hardware Software Interface, Patterson &Hennessy, the 4th Edition

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    3/63

    Grading Information

    Grade determinates

    Midterm Exam ~25%

    Final Exam 1 ~35%

    Homeworks ~40%

    - Due at the beginning of class (or, if its code to be submittedelectronically, by 17:00 on the due date). No late assignments

    will be accepted.

    Course prerequisites

    grade of C or better in CS 367

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    4/63

    Acknowledgements

    Slides adopted from Dr. Zhong

    Contributions from Dr. Setia

    Slides also adopt materials from many other universities

    IMPORTANT:

    - Slides are not intended as replacement for the text

    - You spent the money on the book, please read it!

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    5/63

    Course Topics (Tentative)

    Instruction set architecture (Chapter 2)

    MIPS

    Arithmetic operations & data (Chapter 3)

    System performance (Chapter 4)

    Processor (Chapter 5)

    Datapath and control

    Pipelining to improve performance (Chapter 6)

    Memory hierarchy (Chapter 7)

    I/O (Chapter 8)

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    6/63

    Focus of the Course

    How computers work

    MIPS instruction set architecture The implementation of MIPS instruction set architecture MIPS

    processor design

    Issues affecting modern processors

    Pipelining processor performance improvement Cache memory system, I/O systems

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    7/63

    Why Learn Computer Architecture?

    You want to call yourself a computer scientist

    Computer architecture impacts every other aspect of computer science

    You need to make a purchasing decision or offer expert advice

    You want to build software people use sell many, many copies-(need performance)

    Both hardware and software affect performance

    - Algorithm determines number of source-level statements

    - Language/compiler/architecture determine machine instructions (Chapter 2and 3)

    - Processor/memory determine how fast instructions are executed (Chapter 5,6, and 7)

    - Assessing and understanding performance(Chapter 4)

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    8/63

    Outline Today

    Course logistics

    Computer architectures overview

    Trends in computer architectures

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    9/63

    Computer Systems

    Software

    Application software Word Processors, Email, InternetBrowsers, Games

    Systems software Compilers, Operating Systems

    Hardware

    CPU Memory

    I/O devices (mouse, keyboard, display, disks, networks,..)

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    10/63

    Operatingsystems

    Applicationssoftware

    laTEX

    Virtualmemory

    Filesystem

    I/Odevicedrivers

    Assemblers

    as

    Compilers

    gcc

    Systemssoftware

    SoftwareSoftware

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    11/63D.Barbar

    instruction set

    software

    hardware

    Instruction Set Architecture

    One of the most important abstractions is ISA

    A critical interface between HW and SW

    Example: MIPS

    Desired properties Convenience (from software side)

    Efficiency (from hardware side)

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    12/63D.Barbar

    What is Computer Architecture Programmers view: a pleasant environment

    Operating systems view: a set of resources (hw

    & sw)

    System architecture view: a set of components

    Compilers view: an instruction set architecturewith OS help

    Microprocessor architecture view: a set of

    functional units

    VLSI designers view: a set of transistors

    implementing logic

    Mechanical engineers view: a heater!

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    13/63

    D.Barbar

    What is Computer Architecture

    Patterson & Hennessy: Computer

    architecture = Instruction set architecture

    + Machine organization + Hardware

    For this course, computer architecture

    mainly refers to ISA (Instruction SetArchitecture)

    Programmer-visible, serves as the boundary

    between the software and hardwareModern ISA examples: MIPS, SPARC,

    PowerPC, DEC Alpha

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    14/63

    D.Barbar

    Organization and Hardware Organization: high-level aspects of a computers

    design Principal components: memory, CPU, I/O,

    How components are interconnected

    How information flows between components

    E.g. AMD Opteron 64 and Intel Pentium 4: same ISA

    but different organizations

    Hardware: detailed logic design and the

    packaging technology of a computer E.g. Pentium 4 and Mobile Pentium 4: nearly identical

    organizations but different hardware details

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    15/63

    Types of computers and their applications

    Desktop

    Run third-party software

    Office to home applications

    30 years old

    Servers

    Modern version of what used to be called mainframes,

    minicomputers and supercomputers Large workloads

    Built using the same technology in desktops but higher capacity

    - Expandable

    - Scalable

    - Reliable

    Large spectrum: from low-end (file storage, small businesses) tosupercomputers (high end scientific and engineeringapplications)

    - Gigabytes to Terabytes to Petabytes of storage

    Examples: file servers, web servers, database servers

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    16/63

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    17/63

    Where is the Market?

    290

    93

    3

    488

    114

    3

    892

    135

    4

    862

    129

    4

    1122

    131

    50

    200

    400

    600

    800

    1000

    1200

    1998 1999 2000 2001 2002

    Embedded

    Desktop

    Servers

    MillionsofComputers

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    18/63

    In this class you will learn

    How programs written in a high-level language (e.g.,Java) translate into the language of the hardware andhow the hardware executes them.

    The interface between software and hardware and howsoftware instructs hardware to perform the neededfunctions.

    The factors that determine the performance of a program

    The techniques that hardware designers employ toimprove performance.

    As a consequence, you will understand what features maymake one computer design better than another for aparticular application

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    19/63

    High-level to Machine Language

    High-level language program

    (in C)

    Assembly language program

    (for MIPS)

    Binary machine language program

    (for MIPS)

    Compiler

    Assembler

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    20/63

    Evolution

    In the beginning there were only bits and people spentcountless hours trying to program in machine language

    01100011001 011001110100

    Finally before everybody went insane, the assemblerwas invented: write in mnemonics called assembly

    language and let the assembler translate (a one to onetranslation)

    Add A,B

    This wasnt for everybody, obviously (imagine how

    modern applications would have been possible inassembly), so high-level language were born (and withthem compilers to translate to assembly, a many-to-onetranslation)

    C= A*(SQRT(B)+3.0)

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    21/63

    THE BIG IDEA

    Levels of abstraction: each layer provides its own(simplified) view and hides the details of the next.

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    22/63

    Instruction Set Architecture (ISA)

    ISA: An abstract interface between the hardware andthe lowest level software of a machine that encompassesall the information necessary to write a machinelanguage program that will run correctly, includinginstructions, registers, memory access, I/O, and so on.

    ... the attributes of a [computing] system as seen by theprogrammer, i.e., the conceptual structure and functionalbehavior, as distinct from the organization of the data flows andcontrols, the logic design, and the physical implementation. Amdahl, Blaauw, and Brooks, 1964

    Enables implementations of varying cost and performance to runidentical software

    ABI (application binary interface): The user portion of theinstruction set plus the operating system interfaces usedby application programmers. Defines a standard forbinary portability across computers.

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    23/63

    ISA Type Sales

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1998 1999 2000 2001 2002

    Other

    SPARCHitachi SH

    PowerPC

    Motorola 68K

    MIPS

    IA-32ARM

    PowerPoint comic bar chart with approximate values (see

    text for correct values)

    MillionsofPro

    cessor

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    24/63

    Organization of a computer

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    25/63

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    26/63

    PC Motherboard Closeup

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    27/63

    Inside the Pentium 4

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    28/63

    Moores Law

    In 1965, Gordon Moore predicted that the number of

    transistors that can be integrated on a die would doubleevery 18 to 24 months (i.e., grow exponentially withtime).

    Amazingly visionary million transistor/chip barrier wascrossed in the 1980s.

    2300 transistors, 1 MHz clock (Intel 4004) - 1971

    16 Million transistors (Ultra Sparc III)

    42 Million transistors, 2 GHz clock (Intel Xeon) 2001 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die

    (Intel Pentium 4) - 2004

    140 Million transistor (HP PA-8500)

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    29/63

    Processor Performance Increase

    1

    10

    100

    1000

    10000

    1987 1989 1991 1993 1995 1997 1999 2001 2003

    Year

    Performance

    (SPECInt)

    SUN-4/260 MIPS M/120

    MIPS M2000

    IBM RS6000

    HP 9000/750

    DEC AXP/500 IBM POWER 100

    DEC Alpha 4/266DEC Alpha 5/500

    DEC Alpha 21264/600

    DEC Alpha 5/300

    DEC Alpha 21264A/667Intel Xeon/2000

    Intel Pentium 4/3000

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    30/63

    Year

    1000

    10000

    100000

    1000000

    10000000

    100000000

    1970 1975 1980 1985 1990 1995 2000

    i80386

    i4004

    i8080

    Pentium

    i80486

    i80286

    i8086

    CMOS improvements: Die size: 2X every 3 yrs Line width: halve / 7 yrs

    Itanium II: 241 millionPentium 4: 55 millionAlpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million

    Moores Law

    Trend: Microprocessor Capacity

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    31/63

    Moores Law

    Cramming More Components onto Integrated Circuits

    Gordon Moore, Electronics, 1965

    # of transistors per cost-effective integrated circuit doubles every 18 months

    Transistor capacity doubles every 18-24 months

    Speed 2x / 1.5 years (since 85);

    100X performance in last decade

    Trend: Microprocessor Performance

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    32/63

    Trend: Microprocessor Performance

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    33/63

    Memory

    Dynamic Random Access Memory (DRAM)

    The choice for main memory

    Volatile (contents go away when power is lost) Fast

    Relatively small

    DRAM capacity: 2x / 2 years (since 96);64x size improvement in last decade

    Static Random Access Memory (SRAM)

    The choice for cache Much faster than DRAM, but less dense and more costly

    Magnetic disks

    The choice for secondary memory

    Non-volatile

    Slower Relatively large

    Capacity: 2x / 1 year (since 97)250X size in last decade

    Solid state (Flash) memory

    The choice for embedded computers

    Non-volatile

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    34/63

    Memory

    Optical disks

    Removable, therefore very large

    Slower than disks

    Magnetic tape

    Even slower

    Sequential (non-random) access The choice for archival

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    35/63

    DRAM Capacity Growth

    10

    100

    1000

    10000

    100000

    1000000

    1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002

    Year of introduction

    Kbitcapacity

    16K

    64K

    256K

    1M

    4M

    16M

    64M128M

    256M

    512M

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    36/63

    Trend: Memory Capacity

    size

    Year

    1000

    10000

    100000

    1000000

    10000000

    100000000

    1000000000

    1970 1975 1980 1985 1990 1995 2000

    year size (Mbit)

    1980 0.0625

    1983 0.25

    1986 1

    1989 4

    1992 16

    1996 64

    1998 1282000 256

    2002 512

    2006 2048

    Now 1.4X/yr, or 2X every 2 years.

    more than 10000X since 1980!

    Growth of capacity per chip

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    37/63

    (Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta = 1024)

    Come up with a clever mnemonic, fame!

    Dramatic Technology Change

    State-of-the-art PC when you graduate:(at least)

    Processor clock speed: 5000 MegaHertz(5.0 GigaHertz)

    Memory capacity: 4000 MegaBytes(4.0 GigaBytes)

    Disk capacity: 2000 GigaBytes(2.0 TeraBytes)

    New units! Mega => Giga, Giga => Tera

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    38/63

    Example Machine Organization

    Workstation design target

    25% of cost on processor

    25% of cost on memory (minimum memory size)

    Rest on I/O devices, power supplies, box

    CPU

    Computer

    Control

    Datapath

    Memory Devices

    Input

    Output

    MIPS R3000 I t ti S t A hit t

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    39/63

    MIPS R3000 Instruction Set Architecture

    Instruction Categories

    Load/Store Computational

    Jump and Branch

    Floating Point

    - coprocessor

    Memory Management

    Special

    R0 - R31

    PC

    HI

    LO

    OP

    OP

    OP

    rs rt rd sa funct

    rs rt immediate

    jump target

    3 Instruction Formats: all 32 bits wide

    Registers

    1

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    40/63

    Defining Performance

    Which airplane has the best performance?

    0 100 200 300 400 500

    Douglas

    DC-8-50

    BAC/Sud

    Concorde

    Boeing 747

    Boeing 777

    Passenger Capacity

    0 2000 4000 6000 8000 10000

    Douglas DC-

    8-50

    BAC/Sud

    Concorde

    Boeing 747

    Boeing 777

    Cruising Range (miles)

    0 500 1000 1500

    Douglas

    DC-8-50

    BAC/Sud

    Concorde

    Boeing 747

    Boeing 777

    Cruising Speed (mph)

    0 100000 200000 300000 400000

    Douglas DC-

    8-50

    BAC/Sud

    Concorde

    Boeing 747

    Boeing 777

    Passengers x mph

    .4Performa

    nce

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    41/63

    Response Time and Throughput

    Response time

    How long it takes to do a task

    Throughput

    Total work done per unit time

    - e.g., tasks/transactions/ per hour

    How are response time and throughput affected by

    Replacing the processor with a faster version?

    Adding more processors?

    Well focus on response time for now

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    42/63

    Relative Performance

    Define Performance = 1/Execution Time

    X is ntime faster than Y

    n

    XY

    YX

    timeExecutiontimeExecution

    ePerformancePerformanc

    Example: time taken to run a program

    10s on A, 15s on B

    Execution TimeB / Execution TimeA

    = 15s / 10s = 1.5 So A is 1.5 times faster than B

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    43/63

    Measuring Execution Time

    Elapsed time

    Total response time, including all aspects

    - Processing, I/O, OS overhead, idle time

    Determines system performance

    CPU time

    Time spent processing a given job- Discounts I/O time, other jobs shares

    Comprises user CPU time and system CPU time

    Different programs are affected differently by CPU and systemperformance

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    44/63

    CPU Clocking

    Operation of digital hardware governed by a constant-rate clock

    Clock (cycles)

    Data transferand computation

    Update state

    Clock period

    Clock period: duration of a clock cycle

    e.g., 250ps = 0.25ns = 2501012s

    Clock frequency (rate): cycles per second

    e.g., 4.0GHz = 4000MHz = 4.0109Hz

    C

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    45/63

    CPU Time

    Performance improved by

    Reducing number of clock cycles

    Increasing clock rate

    Hardware designer must often trade off clock rate against cyclecount

    RateClockCyclesClockCPU

    TimeCycleClockCyclesClockCPUTimeCPU

    CPU Ti E l

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    46/63

    CPU Time Example

    Computer A: 2GHz clock, 10s CPU time

    Designing Computer B

    Aim for 6s CPU time

    Can do faster clock, but causes 1.2 clock cycles

    How fast must Computer B clock be?

    4GHz6s

    1024

    6s

    10201.2RateClock

    10202GHz10s

    RateClockTimeCPUCyclesClock

    6s

    CyclesClock1.2

    TimeCPU

    CyclesClockRateClock

    99

    B

    9

    AAA

    A

    B

    BB

    I t ti C t d CPI

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    47/63

    Instruction Count and CPI

    Instruction Count for a program

    Determined by program, ISA and compiler

    Average cycles per instruction

    Determined by CPU hardware

    If different instructions have different CPI

    - Average CPI affected by instruction mix

    RateClock

    CPICountnInstructio

    TimeCycleClockCPICountnInstructioTimeCPU

    nInstructioperCyclesCountnInstructioCyclesClock

    CPI E l

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    48/63

    CPI Example

    Computer A: Cycle Time = 250ps, CPI = 2.0

    Computer B: Cycle Time = 500ps, CPI = 1.2

    Same ISA

    Which is faster, and by how much?

    1.2500psI

    600psI

    ATimeCPU

    BTimeCPU

    600psI500ps1.2I

    BTimeCycle

    BCPICountnInstructio

    BTimeCPU

    500psI250ps2.0I

    ATimeCycleACPICountnInstructioATimeCPU

    A is faster

    by this much

    CPI i M D t il

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    49/63

    CPI in More Detail

    If different instruction classes take different numbers ofcycles

    n

    1i

    ii )CountnInstructio(CPICyclesClock

    Weighted average CPI

    n

    1i

    i

    i CountnInstructio

    CountnInstructio

    CPICountnInstructio

    CyclesClock

    CPI

    Relative frequency

    CPI E l

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    50/63

    CPI Example

    Alternative compiled code sequences using instructions in classes A,B, C

    Class A B C

    CPI for class 1 2 3

    IC in sequence 1 2 1 2IC in sequence 2 4 1 1

    Sequence 1: IC = 5

    Clock Cycles

    = 21 + 12 + 23= 10

    Avg. CPI = 10/5 = 2.0

    Sequence 2: IC = 6

    Clock Cycles

    = 41 + 12 + 13= 9

    Avg. CPI = 9/6 = 1.5

    P f S

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    51/63

    Performance Summary

    Performance depends on

    Algorithm: affects IC, possibly CPI

    Programming language: affects IC, CPI

    Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc

    The BIG Picture

    cycleClock

    Seconds

    nInstructio

    cyclesClock

    Program

    nsInstructioTimeCPU

    P T d

    1.5

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    52/63

    Power Trends

    In CMOS IC technology

    5ThePowerWall

    FrequencyVoltageloadCapacitivePower 2

    100030 5V 1V

    Red cing Po er

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    53/63

    Reducing Power

    Suppose a new CPU has

    85% of capacitive load of old CPU

    15% voltage and 15% frequency reduction

    0.520.85FVC

    0.85F0.85)(V0.85CPP 4

    old

    2

    oldold

    old

    2

    oldold

    old

    new

    The power wall

    We cant reduce voltage further

    We cant remove more heat

    How else can we improve performance?

    Uniprocessor Performance

    1.6

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    54/63

    Uniprocessor Performance6TheSea

    Change:TheSwitchtoMultiprocessors

    Constrained by power, instruction-level parallelism,

    memory latency

    Multiprocessors

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    55/63

    Multiprocessors

    Multicore microprocessors

    More than one processor per chip

    Requires explicitly parallel programming

    Compare with instruction level parallelism

    - Hardware executes multiple instructions at once

    - Hidden from the programmer

    Hard to do

    - Programming for performance

    - Load balancing

    - Optimizing communication and synchronization

    SPEC CPU Benchmark

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    56/63

    SPEC CPU Benchmark

    Programs used to measure performance

    Supposedly typical of actual workload

    Standard Performance Evaluation Corp (SPEC)

    Develops benchmarks for CPU, I/O, Web,

    SPEC CPU2006

    Elapsed time to execute a selection of programs

    - Negligible I/O, so focuses on CPU performance

    Normalize relative to reference machine

    Summarize as geometric mean of performance ratios

    - CINT2006 (integer) and CFP2006 (floating-point)

    n

    n

    1i

    iratiotimeExecution

    CINT2006 for Opteron X4 2356

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    57/63

    CINT2006 for Opteron X4 2356

    Name Description IC109 CPI Tc (ns) Exec time Ref time SPECratio

    perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3

    bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8

    gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1

    mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8

    go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6

    hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5

    sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5

    libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8

    h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3

    omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1

    astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1

    xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0

    Geometric mean 11.7

    High cache miss rates

    SPEC Power Benchmark

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    58/63

    SPEC Power Benchmark

    Power consumption of server at different workload levels

    Performance: ssj_ops/sec

    Power: Watts (Joules/sec)

    10

    0i

    i

    10

    0i

    i powerssj_opsWattperssj_opsOverall

    SPECpower ssj2008 for X4

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    59/63

    SPECpower_ssj2008 for X4

    Target Load % Performance (ssj_ops/sec) Average Power (Watts)

    100% 231,867 295

    90% 211,282 286

    80% 185,803 275

    70% 163,427 265

    60% 140,160 256

    50% 118,324 246

    40% 920,35 23330% 70,500 222

    20% 47,126 206

    10% 23,066 180

    0% 0 141

    Overall sum 1,283,590 2,605

    ssj_ops/ power 493

    Pitfall: Amdahls Law

    1.8

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    60/63

    Pitfall: Amdahl s Law

    Improving an aspect of a computer and expecting a proportionalimprovement in overall performance

    8Fallacies

    andPitfalls

    2080

    20 n

    Cant be done!

    unaffectedaffected

    improved T

    factortimprovemen

    TT

    Example: multiply accounts for 80s/100s

    How much improvement in multiply performance to get 5 overall?

    Corollary: make the common case fast

    Fallacy: Low Power at Idle

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    61/63

    Fallacy: Low Power at Idle

    Look back at X4 power benchmark

    At 100% load: 295W

    At 50% load: 246W (83%)

    At 10% load: 180W (61%)

    Google data center

    Mostly operates at 10% 50% load At 100% load less than 1% of the time

    Consider designing processors to make powerproportional to load

    Pitfall: MIPS as a Performance Metric

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    62/63

    Pitfall: MIPS as a Performance Metric

    MIPS: Millions of Instructions Per Second

    Doesnt account for

    - Differences in ISAs between computers

    - Differences in complexity between instructions

    66

    6

    10CPI

    rateClock

    10rateClock

    CPIcountnInstructio

    countnInstructio

    10timeExecution

    countnInstructioMIPS

    CPI varies between programs on a given CPU

    Concluding Remarks

    1.9

  • 7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

    63/63

    Concluding Remarks

    Cost/performance is improving

    Due to underlying technology development

    Hierarchical layers of abstraction

    In both hardware and software

    Instruction set architecture

    The hardware/software interface

    Execution time: the best performance measure

    Power is a limiting factor

    Use parallelism to improve performance

    9ConcludingRemarks