CS465Lec1n.m,,nmnm knlknlknklnklnknknl

7/27/2019 CS465Lec1n.m,,nmnm knlknlknklnklnknknl

1/63

CS 465Computer ArchitectureFall 2009

Lecture 01: Introduction

Daniel Barbar ( cs.gmu.edu/~dbarbara)[Adapted from Computer Organization and Design,

Patterson & Hennessy, 2005, UCB]


2/63

Course Administration

Instructor: Daniel Barbar

[email protected] Eng. Bldg.

Text: Required: Computer Organization & DesignThe Hardware Software Interface, Patterson &Hennessy, the 4th Edition


3/63

Grading Information

Grade determinates

Midterm Exam ~25%

Final Exam 1 ~35%

Homeworks ~40%

- Due at the beginning of class (or, if its code to be submittedelectronically, by 17:00 on the due date). No late assignments

will be accepted.

Course prerequisites

grade of C or better in CS 367


4/63

Acknowledgements

Slides adopted from Dr. Zhong

Contributions from Dr. Setia

Slides also adopt materials from many other universities

IMPORTANT:

- Slides are not intended as replacement for the text

- You spent the money on the book, please read it!


5/63

Course Topics (Tentative)

Instruction set architecture (Chapter 2)

MIPS

Arithmetic operations & data (Chapter 3)

System performance (Chapter 4)

Processor (Chapter 5)

Datapath and control

Pipelining to improve performance (Chapter 6)

Memory hierarchy (Chapter 7)

I/O (Chapter 8)


6/63

Focus of the Course

How computers work

MIPS instruction set architecture The implementation of MIPS instruction set architecture MIPS

processor design

Issues affecting modern processors

Pipelining processor performance improvement Cache memory system, I/O systems


7/63

Why Learn Computer Architecture?

You want to call yourself a computer scientist

Computer architecture impacts every other aspect of computer science

You need to make a purchasing decision or offer expert advice

You want to build software people use sell many, many copies-(need performance)

Both hardware and software affect performance

- Algorithm determines number of source-level statements

- Language/compiler/architecture determine machine instructions (Chapter 2and 3)

- Processor/memory determine how fast instructions are executed (Chapter 5,6, and 7)

- Assessing and understanding performance(Chapter 4)


8/63

Outline Today

Course logistics

Computer architectures overview

Trends in computer architectures


9/63

Computer Systems

Software

Application software Word Processors, Email, InternetBrowsers, Games

Systems software Compilers, Operating Systems

Hardware

CPU Memory

I/O devices (mouse, keyboard, display, disks, networks,..)


10/63

Operatingsystems

Applicationssoftware

laTEX

Virtualmemory

Filesystem

I/Odevicedrivers

Assemblers

as

Compilers

gcc

Systemssoftware

SoftwareSoftware


11/63D.Barbar

instruction set

software

hardware

Instruction Set Architecture

One of the most important abstractions is ISA

A critical interface between HW and SW

Example: MIPS

Desired properties Convenience (from software side)

Efficiency (from hardware side)


12/63D.Barbar

What is Computer Architecture Programmers view: a pleasant environment

Operating systems view: a set of resources (hw

& sw)

System architecture view: a set of components

Compilers view: an instruction set architecturewith OS help

Microprocessor architecture view: a set of

functional units

VLSI designers view: a set of transistors

implementing logic

Mechanical engineers view: a heater!


13/63

D.Barbar

What is Computer Architecture

Patterson & Hennessy: Computer

architecture = Instruction set architecture

+ Machine organization + Hardware

For this course, computer architecture

mainly refers to ISA (Instruction SetArchitecture)

Programmer-visible, serves as the boundary

between the software and hardwareModern ISA examples: MIPS, SPARC,

PowerPC, DEC Alpha


14/63

D.Barbar

Organization and Hardware Organization: high-level aspects of a computers

design Principal components: memory, CPU, I/O,

How components are interconnected

How information flows between components

E.g. AMD Opteron 64 and Intel Pentium 4: same ISA

but different organizations

Hardware: detailed logic design and the

packaging technology of a computer E.g. Pentium 4 and Mobile Pentium 4: nearly identical

organizations but different hardware details


15/63

Types of computers and their applications

Desktop

Run third-party software

Office to home applications

30 years old

Servers

Modern version of what used to be called mainframes,

minicomputers and supercomputers Large workloads

Built using the same technology in desktops but higher capacity

- Expandable

- Scalable

- Reliable

Large spectrum: from low-end (file storage, small businesses) tosupercomputers (high end scientific and engineeringapplications)

- Gigabytes to Terabytes to Petabytes of storage

Examples: file servers, web servers, database servers


16/63


17/63

Where is the Market?

290

93

3

488

114

3

892

135

4

862

129

4

1122

131

50

200

400

600

800

1000

1200

1998 1999 2000 2001 2002

Embedded

Desktop

Servers

MillionsofComputers


18/63

In this class you will learn

How programs written in a high-level language (e.g.,Java) translate into the language of the hardware andhow the hardware executes them.

The interface between software and hardware and howsoftware instructs hardware to perform the neededfunctions.

The factors that determine the performance of a program

The techniques that hardware designers employ toimprove performance.

As a consequence, you will understand what features maymake one computer design better than another for aparticular application


19/63

High-level to Machine Language

High-level language program

(in C)

Assembly language program

(for MIPS)

Binary machine language program

(for MIPS)

Compiler

Assembler


20/63

Evolution

In the beginning there were only bits and people spentcountless hours trying to program in machine language

01100011001 011001110100

Finally before everybody went insane, the assemblerwas invented: write in mnemonics called assembly

language and let the assembler translate (a one to onetranslation)

Add A,B

This wasnt for everybody, obviously (imagine how

modern applications would have been possible inassembly), so high-level language were born (and withthem compilers to translate to assembly, a many-to-onetranslation)

C= A*(SQRT(B)+3.0)


21/63

THE BIG IDEA

Levels of abstraction: each layer provides its own(simplified) view and hides the details of the next.


22/63

Instruction Set Architecture (ISA)

ISA: An abstract interface between the hardware andthe lowest level software of a machine that encompassesall the information necessary to write a machinelanguage program that will run correctly, includinginstructions, registers, memory access, I/O, and so on.

... the attributes of a [computing] system as seen by theprogrammer, i.e., the conceptual structure and functionalbehavior, as distinct from the organization of the data flows andcontrols, the logic design, and the physical implementation. Amdahl, Blaauw, and Brooks, 1964

Enables implementations of varying cost and performance to runidentical software

ABI (application binary interface): The user portion of theinstruction set plus the operating system interfaces usedby application programmers. Defines a standard forbinary portability across computers.


23/63

ISA Type Sales

0

200

400

600

800

1000

1200

1400

1998 1999 2000 2001 2002

Other

SPARCHitachi SH

PowerPC

Motorola 68K

MIPS

IA-32ARM

PowerPoint comic bar chart with approximate values (see

text for correct values)

MillionsofPro

cessor


24/63

Organization of a computer


25/63


26/63

PC Motherboard Closeup


27/63

Inside the Pentium 4


28/63

Moores Law

In 1965, Gordon Moore predicted that the number of

transistors that can be integrated on a die would doubleevery 18 to 24 months (i.e., grow exponentially withtime).

Amazingly visionary million transistor/chip barrier wascrossed in the 1980s.

2300 transistors, 1 MHz clock (Intel 4004) - 1971

16 Million transistors (Ultra Sparc III)

42 Million transistors, 2 GHz clock (Intel Xeon) 2001 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die

(Intel Pentium 4) - 2004

140 Million transistor (HP PA-8500)


29/63

Processor Performance Increase

1

10

100

1000

10000

1987 1989 1991 1993 1995 1997 1999 2001 2003

Year

Performance

(SPECInt)

SUN-4/260 MIPS M/120

MIPS M2000

IBM RS6000

HP 9000/750

DEC AXP/500 IBM POWER 100

DEC Alpha 4/266DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 21264A/667Intel Xeon/2000

Intel Pentium 4/3000


30/63

Year

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000

i80386

i4004

i8080

Pentium

i80486

i80286

i8086

CMOS improvements: Die size: 2X every 3 yrs Line width: halve / 7 yrs

Itanium II: 241 millionPentium 4: 55 millionAlpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million

Moores Law

Trend: Microprocessor Capacity


31/63

Moores Law

Cramming More Components onto Integrated Circuits

Gordon Moore, Electronics, 1965

# of transistors per cost-effective integrated circuit doubles every 18 months

Transistor capacity doubles every 18-24 months

Speed 2x / 1.5 years (since 85);

100X performance in last decade

Trend: Microprocessor Performance


32/63

Trend: Microprocessor Performance


33/63

Memory

Dynamic Random Access Memory (DRAM)

The choice for main memory

Volatile (contents go away when power is lost) Fast

Relatively small

DRAM capacity: 2x / 2 years (since 96);64x size improvement in last decade

Static Random Access Memory (SRAM)

The choice for cache Much faster than DRAM, but less dense and more costly

Magnetic disks

The choice for secondary memory

Non-volatile

Slower Relatively large

Capacity: 2x / 1 year (since 97)250X size in last decade

Solid state (Flash) memory

The choice for embedded computers

Non-volatile


34/63

Memory

Optical disks

Removable, therefore very large

Slower than disks

Magnetic tape

Even slower

Sequential (non-random) access The choice for archival


35/63

DRAM Capacity Growth

10

100

1000

10000

100000

1000000

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002

Year of introduction

Kbitcapacity

16K

64K

256K

1M

4M

16M

64M128M

256M

512M


36/63

Trend: Memory Capacity

size

Year

1000

10000

100000

1000000

10000000

100000000

1000000000

1970 1975 1980 1985 1990 1995 2000

year size (Mbit)

1980 0.0625

1983 0.25

1986 1

1989 4

1992 16

1996 64

1998 1282000 256

2002 512

2006 2048

Now 1.4X/yr, or 2X every 2 years.

more than 10000X since 1980!

Growth of capacity per chip


37/63

(Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta = 1024)

Come up with a clever mnemonic, fame!

Dramatic Technology Change

State-of-the-art PC when you graduate:(at least)

Processor clock speed: 5000 MegaHertz(5.0 GigaHertz)

Memory capacity: 4000 MegaBytes(4.0 GigaBytes)

Disk capacity: 2000 GigaBytes(2.0 TeraBytes)

New units! Mega => Giga, Giga => Tera


38/63

Example Machine Organization

Workstation design target

25% of cost on processor

25% of cost on memory (minimum memory size)

Rest on I/O devices, power supplies, box

CPU

Computer

Control

Datapath

Memory Devices

Input

Output

MIPS R3000 I t ti S t A hit t


39/63

MIPS R3000 Instruction Set Architecture

Instruction Categories

Load/Store Computational

Jump and Branch

Floating Point

- coprocessor

Memory Management

Special

R0 - R31

PC

HI

LO

OP

OP

OP

rs rt rd sa funct

rs rt immediate

jump target

3 Instruction Formats: all 32 bits wide

Registers

1


40/63

Defining Performance

Which airplane has the best performance?

0 100 200 300 400 500

Douglas

DC-8-50

BAC/Sud

Concorde

Boeing 747

Boeing 777

Passenger Capacity

0 2000 4000 6000 8000 10000

Douglas DC-

8-50

BAC/Sud

Concorde

Boeing 747

Boeing 777

Cruising Range (miles)

0 500 1000 1500

Douglas

DC-8-50

BAC/Sud

Concorde

Boeing 747

Boeing 777

Cruising Speed (mph)

0 100000 200000 300000 400000

Douglas DC-

8-50

BAC/Sud

Concorde

Boeing 747

Boeing 777

Passengers x mph

.4Performa

nce


41/63

Response Time and Throughput

Response time

How long it takes to do a task

Throughput

Total work done per unit time

- e.g., tasks/transactions/ per hour

How are response time and throughput affected by

Replacing the processor with a faster version?

Adding more processors?

Well focus on response time for now


42/63

Relative Performance

Define Performance = 1/Execution Time

X is ntime faster than Y

n

XY

YX

timeExecutiontimeExecution

ePerformancePerformanc

Example: time taken to run a program

10s on A, 15s on B

Execution TimeB / Execution TimeA

= 15s / 10s = 1.5 So A is 1.5 times faster than B


43/63

Measuring Execution Time

Elapsed time

Total response time, including all aspects

- Processing, I/O, OS overhead, idle time

Determines system performance

CPU time

Time spent processing a given job- Discounts I/O time, other jobs shares

Comprises user CPU time and system CPU time

Different programs are affected differently by CPU and systemperformance


44/63

CPU Clocking

Operation of digital hardware governed by a constant-rate clock

Clock (cycles)

Data transferand computation

Update state

Clock period

Clock period: duration of a clock cycle

e.g., 250ps = 0.25ns = 2501012s

Clock frequency (rate): cycles per second

e.g., 4.0GHz = 4000MHz = 4.0109Hz

C


45/63

CPU Time

Performance improved by

Reducing number of clock cycles

Increasing clock rate

Hardware designer must often trade off clock rate against cyclecount

RateClockCyclesClockCPU

TimeCycleClockCyclesClockCPUTimeCPU

CPU Ti E l


46/63

CPU Time Example

Computer A: 2GHz clock, 10s CPU time

Designing Computer B

Aim for 6s CPU time

Can do faster clock, but causes 1.2 clock cycles

How fast must Computer B clock be?

4GHz6s

1024

6s

10201.2RateClock

10202GHz10s

RateClockTimeCPUCyclesClock

6s

CyclesClock1.2

TimeCPU

CyclesClockRateClock

99

B

9

AAA

A

B

BB

I t ti C t d CPI


47/63

Instruction Count and CPI

Instruction Count for a program

Determined by program, ISA and compiler

Average cycles per instruction

Determined by CPU hardware

If different instructions have different CPI

- Average CPI affected by instruction mix

RateClock

CPICountnInstructio

TimeCycleClockCPICountnInstructioTimeCPU

nInstructioperCyclesCountnInstructioCyclesClock

CPI E l


48/63

CPI Example

Computer A: Cycle Time = 250ps, CPI = 2.0

Computer B: Cycle Time = 500ps, CPI = 1.2

Same ISA

Which is faster, and by how much?

1.2500psI

600psI

ATimeCPU

BTimeCPU

600psI500ps1.2I

BTimeCycle

BCPICountnInstructio

BTimeCPU

500psI250ps2.0I

ATimeCycleACPICountnInstructioATimeCPU

A is faster

by this much

CPI i M D t il


49/63

CPI in More Detail

If different instruction classes take different numbers ofcycles

n

1i

ii )CountnInstructio(CPICyclesClock

Weighted average CPI

n

1i

i

i CountnInstructio

CountnInstructio

CPICountnInstructio

CyclesClock

CPI

Relative frequency

CPI E l


50/63

CPI Example

Alternative compiled code sequences using instructions in classes A,B, C

Class A B C

CPI for class 1 2 3

IC in sequence 1 2 1 2IC in sequence 2 4 1 1

Sequence 1: IC = 5

Clock Cycles

= 21 + 12 + 23= 10

Avg. CPI = 10/5 = 2.0

Sequence 2: IC = 6

Clock Cycles

= 41 + 12 + 13= 9

Avg. CPI = 9/6 = 1.5

P f S


51/63

Performance Summary

Performance depends on

Algorithm: affects IC, possibly CPI

Programming language: affects IC, CPI

Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc

The BIG Picture

cycleClock

Seconds

nInstructio

cyclesClock

Program

nsInstructioTimeCPU

P T d

1.5


52/63

Power Trends

In CMOS IC technology

5ThePowerWall

FrequencyVoltageloadCapacitivePower 2

100030 5V 1V

Red cing Po er


53/63

Reducing Power

Suppose a new CPU has

85% of capacitive load of old CPU

15% voltage and 15% frequency reduction

0.520.85FVC

0.85F0.85)(V0.85CPP 4

old

2

oldold

old

2

oldold

old

new

The power wall

We cant reduce voltage further

We cant remove more heat

How else can we improve performance?

Uniprocessor Performance

1.6


54/63

Uniprocessor Performance6TheSea

Change:TheSwitchtoMultiprocessors

Constrained by power, instruction-level parallelism,

memory latency

Multiprocessors


55/63

Multiprocessors

Multicore microprocessors

More than one processor per chip

Requires explicitly parallel programming

Compare with instruction level parallelism

- Hardware executes multiple instructions at once

- Hidden from the programmer

Hard to do

- Programming for performance

- Load balancing

- Optimizing communication and synchronization

SPEC CPU Benchmark


56/63

SPEC CPU Benchmark

Programs used to measure performance

Supposedly typical of actual workload

Standard Performance Evaluation Corp (SPEC)

Develops benchmarks for CPU, I/O, Web,

SPEC CPU2006

Elapsed time to execute a selection of programs

- Negligible I/O, so focuses on CPU performance

Normalize relative to reference machine

Summarize as geometric mean of performance ratios

- CINT2006 (integer) and CFP2006 (floating-point)

n

n

1i

iratiotimeExecution

CINT2006 for Opteron X4 2356


57/63

CINT2006 for Opteron X4 2356

Name Description IC109 CPI Tc (ns) Exec time Ref time SPECratio

perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3

bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8

gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1

mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8

go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6

hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5

sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5

libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8

h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3

omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1

astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1

xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0

Geometric mean 11.7

High cache miss rates

SPEC Power Benchmark


58/63

SPEC Power Benchmark

Power consumption of server at different workload levels

Performance: ssj_ops/sec

Power: Watts (Joules/sec)

10

0i

i

10

0i

i powerssj_opsWattperssj_opsOverall

SPECpower ssj2008 for X4


59/63

SPECpower_ssj2008 for X4

Target Load % Performance (ssj_ops/sec) Average Power (Watts)

100% 231,867 295

90% 211,282 286

80% 185,803 275

70% 163,427 265

60% 140,160 256

50% 118,324 246

40% 920,35 23330% 70,500 222

20% 47,126 206

10% 23,066 180

0% 0 141

Overall sum 1,283,590 2,605

ssj_ops/ power 493

Pitfall: Amdahls Law

1.8


60/63

Pitfall: Amdahl s Law

Improving an aspect of a computer and expecting a proportionalimprovement in overall performance

8Fallacies

andPitfalls

2080

20 n

Cant be done!

unaffectedaffected

improved T

factortimprovemen

TT

Example: multiply accounts for 80s/100s

How much improvement in multiply performance to get 5 overall?

Corollary: make the common case fast

Fallacy: Low Power at Idle


61/63

Fallacy: Low Power at Idle

Look back at X4 power benchmark

At 100% load: 295W

At 50% load: 246W (83%)

At 10% load: 180W (61%)

Google data center

Mostly operates at 10% 50% load At 100% load less than 1% of the time

Consider designing processors to make powerproportional to load

Pitfall: MIPS as a Performance Metric


62/63

Pitfall: MIPS as a Performance Metric

MIPS: Millions of Instructions Per Second

Doesnt account for

- Differences in ISAs between computers

- Differences in complexity between instructions

66

6

10CPI

rateClock

10rateClock

CPIcountnInstructio

countnInstructio

10timeExecution

countnInstructioMIPS

CPI varies between programs on a given CPU

Concluding Remarks

1.9


63/63

Concluding Remarks

Cost/performance is improving

Due to underlying technology development

Hierarchical layers of abstraction

In both hardware and software

Instruction set architecture

The hardware/software interface

Execution time: the best performance measure

Power is a limiting factor

Use parallelism to improve performance

9ConcludingRemarks

Documents

CS465Lec1n.m,,nmnm knlknlknklnklnknknl