CS252/Kubiatowicz Lec 1.1 8/25/03 CS252 Graduate Computer Architecture Lecture 1 Review of Technology Trends and Cost/Performance August 25, 2003 Prof

CS252/KubiatowiczLec 1.1

8/25/03

CS252Graduate Computer Architecture

Lecture 1

Review of Technology Trends and Cost/Performance

August 25, 2003

Prof. John Kubiatowicz

http://www.cs.berkeley.edu/~kubitron/cs252-F03


8/25/03

Original

Big Fishes Eating Little Fishes


8/25/03

1988 Computer Food Chain

PCWork-stationMini-

computer

Mainframe

Mini-supercomputer

Supercomputer

Massively Parallel

Processors


8/25/03

1998 Computer Food Chain

PCWork-station

Mainframe

Supercomputer

Mini-supercomputerMassively Parallel

Processors

Mini-computer

Now who is eating whom?

Server


8/25/03

Why Such Change in 10 years?

• Performance– Technology Advances

» CMOS VLSI dominates older technologies (TTL, ECL) in cost AND performance

– Computer architecture advances improves low-end » RISC, superscalar, RAID, …

• Price: Lower costs due to …– Simpler development

» CMOS VLSI: smaller systems, fewer components– Higher volumes

» CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units – Lower margins by class of computer, due to fewer services

• Function– Rise of networking/local interconnection technology


8/25/03

Amazing Underlying Technology Change

• “Cramming More Components onto Integrated Circuits”

– Gordon Moore, Electronics, 1965


8/25/03

Year

Tra

nsis

tors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000

i80386

i4004

i8080

Pentium

i80486

i80286

i8086

Technology Trends: Microprocessor Capacity

CMOS improvements:• Die size: 2X every 3 yrs• Line width: halve / 7 yrs

Pentium 4: 55 millionAlpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million

Moore’s Law


8/25/03

size

Year

Bit

s

1000

10000

100000

1000000

10000000

100000000

1000000000

1970 1975 1980 1985 1990 1995 2000

Memory Capacity (Single Chip DRAM)

year size(Mb) cyc time1980 0.0625 250 ns1983 0.25 220 ns1986 1 190 ns1989 4 165 ns1992 16 145 ns1996 64 120 ns2000 256 100 ns2003 1024 60 ns


8/25/03

Technology dramatic change• Processor

– logic capacity: about 30% per year– clock rate: about 20% per year

• Memory– DRAM capacity: about 60% per year (4x every 3

years)– Memory speed: about 10% per year– Cost per bit: improves about 25% per year

• Disk– capacity: about 60% per year– Total use of data: 100% per 9 months!

• Network Bandwidth– Bandwidth increasing more than 100% per year!


8/25/03

Computers in the News: New IBM Transistor

• Announced 12/10/02• 6nm gate length!!!• Details: Still to be announced


8/25/03

Processor PerformanceTrends

Microprocessors

Minicomputers

Mainframes

Supercomputers

Year

0.1

1

10

100

1000

1965 1970 1975 1980 1985 1990 1995 2000


8/25/03

0

200

400

600

800

1000

1200

87 88 89 90 91 92 93 94 95 96 97

DEC A

lpha

21164/6

00

DEC A

lpha

5/5

00

DEC A

lpha

5/3

00

DEC A

lpha

4/2

66

IBM

PO

WER 1

00

DEC A

XP/

500

HP

9000/7

50

Sun

-4/2

60

IBM

RS

/6000

MIP

S M

/120

MIP

S M

/2000

Processor Performance(1.35X before, 1.55X now)

1.54X/yr


8/25/03

Computer Architecture Is …

the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964

SOFTWARESOFTWARE


8/25/03

Computer Architecture’s Changing Definition

• 1950s to 1960s: Computer Architecture Course: Computer Arithmetic

• 1970s to mid 1980s: Computer Architecture Course: Instruction Set Design, especially ISA appropriate for compilers

• 1990s: Computer Architecture Course:Design of CPU, memory system, I/O system, Multiprocessors, Networks

• 2010s: Computer Architecture Course: Self adapting systems? Self organizing structures?DNA Systems/Quantum Computing?


8/25/03

Instruction Set Architecture (ISA)

instruction set

software

hardware


8/25/03

Evolution of Instruction Sets

Single Accumulator (EDSAC 1950)

Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model from Implementation

High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

RISC

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)


8/25/03

Interface Design

A good interface:

• Lasts through many implementations (portability, compatibility)

• Is used in many differeny ways (generality)

• Provides convenient functionality to higher levels

• Permits an efficient implementation at lower levels

Interfaceimp 1

imp 2

imp 3

use

use

use

time


8/25/03

Virtualization:One of the lessons of RISC

• Integrated Systems Approach – What really matters is the functioning of the complete system,

I.e. hardware, runtime system, compiler, and operating system– In networking, this is called the “End to End argument”– Programmers care about high-level languages, debuggers,

source-level object-oriented programming

• Computer architecture is not just about transistors, individual instructions, or particular implementations

• Original RISC projects replaced complex instructions with a compiler + simple instructions

• Logical Extension => Genetically adaptive runtime systems enhanced by dynamic compilation running on reconfigurable hardware? Perhaps.


8/25/03

Computer Architecture Topics

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID

VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

NetworkCommunication

Oth

er

Pro

cessors


8/25/03

Sample Organization: It’s all about

communication

Proc

CachesBusses

Memory

I/O Devices:

Controllers

adapters

DisksDisplaysKeyboards

Networks

Pentium III Chipset


8/25/03

Computer Architecture Topics

M

Interconnection NetworkS

PMPMPMP° ° °

Topologies,Routing,Bandwidth,Latency,Reliability

Network Interfaces

Shared Memory,Message Passing,Data Parallelism

Processor-Memory-Switch

MultiprocessorsNetworks and Interconnections


8/25/03

CS 252 Course FocusUnderstanding the design techniques, machine structures,

technology factors, evaluation methods that will determine the form of computers in 21st Century

Technology ProgrammingLanguages

OperatingSystems History

Applications Interface Design(ISA)

Measurement & Evaluation

Parallelism

Computer Architecture:• Instruction Set Design• Organization• Hardware/Software Boundary

Compilers


8/25/03

Topic CoverageTextbook: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 3rd Ed., 2002.

Research Papers -- Handed out in class• 1.5 weeks Review: Fundamentals of Computer Architecture (Ch.

1), Instruction Set Architecture (Ch. 2), Pipelining (Ch. 3)• 2.5 weeks: Pipelining, Interrupts, and Instructional Level

Parallelism (Ch. 4), Vector Processors (Appendix B).

• 1.5 weeks: Dynamic Compilation. Data Speculation (papers). Complexity, design via genetic algorithms

• 1 week: Memory Hierarchy (Chapter 5)• 1.5 weeks: Fault Tolerance, Input/Output and Storage (Ch. 6)• 1.5 weeks: Networks and Interconnection Technology (Ch. 7)• 1.5 weeks: Multiprocessors (Ch. 8 + Research papers + Culler

book draft Chapter 1) • 1 week: Quantum Computing, DNA Computing


8/25/03

CS252: InformationInstructor:Prof John D. Kubiatowicz

Office: 673 Soda Hall, 643-6817 kubitron@cs

Office Hours: Wed 3:30 - 5:00 or by appt.

(Contact Veronique Richards, 642-4334, nicou@cs,

676 Soda)

T. A: TBA

Class: Mon/Wed, 1:00 - 2:30pm 310 Soda Hall

Text: Computer Architecture: A Quantitative Approach, Third Edition (2002)

Web page: http://www.cs/~kubitron/courses/cs252-F03/

Lectures available online <11:30AM day of lecture

Newsgroup: ucb.class.cs252

Email: [email protected]


8/25/03

Lecture style

• 1-Minute Review • 20-Minute Lecture/Discussion• 5- Minute Administrative Matters• 25-Minute Lecture/Discussion• 5-Minute Break (water, stretch)• 25-Minute Lecture/Discussion• Instructor will come to class early & stay after

to answer questions

Attention

Time

20 min. Break“In Conclusion, ...”


8/25/03

Grading• 10% Homeworks (work in pairs)• 40% Examinations (2 Midterms)• 40% Research Project (work in pairs)

– Transition from undergrad to grad student– Berkeley wants you to succeed, but you need to show

initiative– pick topic– meet 3 times with faculty/TA to see progress– give oral presentation– give poster session– written report like conference paper– 3 weeks work full time for 2 people– Opportunity to do “research in the small” to help make

transition from good student to research colleague

• 10% Class Participation


8/25/03

Quizes

• Reduce the pressure of taking quizes– Only 2 Graded Quizes:

Tentative: Wed Oct 13th and Wed Dec 1st – Our goal: test knowledge vs. speed writing– 3 hrs to take 1.5-hr test (5:30-8:30 PM, TBA location)– Both mid-term quizes can bring summary sheet

» Transfer ideas from book to paper

– Last chance Q&A: during class time day of exam

• Students/Staff meet over free pizza/drinks at La Vals: Wed Oct 13th (8:30 PM) and Wed Dec 1st (8:30 PM)


8/25/03

Research Paper Reading• As graduate students, you are now

researchers.• Most information of importance to you will be

in research papers.• Ability to rapidly scan and understand

research papers is key to your success.

• So: you will read lots of papers in this course!– Quick 1 paragraph summaries will be due in class– Important supplement to book.– Will discuss papers in class

• Papers will be scanned and on web page.


8/25/03

More Course Info• Everything is on the course Web page:

www.cs.berkeley.edu/~kubitron/courses/cs252-F03

• Notes:– Not sure what the state of textbooks at Student Center.– The course Web page includes a pointer to last term’s 152

home page. The “handout” page includes pointers to old 152 quizes.

• Schedule:– 2 Graded Quizes: Mon Oct 13th and Mon Dec 1st – Veteran’s Day: Friday Nov 5th – Thanksgiving Vacation: Thur Nov 27th - Sun Nov 28th – Oral Presentations: Tue/Wed Dec 9/10th – 252 Last lecture: Fri Dec 3rd – 252 Poster Session: ???– Project Papers/URLs due: Fri Dec 12th

• Project Suggestions: TBA


8/25/03

Related Courses

CS 152CS 152 CS 252CS 252 CS 258CS 258

CS 250CS 250

How to build itImplementation details

Why, Analysis,Evaluation

Parallel Architectures,Languages, Systems

Integrated Circuit Technologyfrom a computer-organization viewpoint

Strong

Prerequisite

Basic knowledge of theorganization of a computeris assumed!


8/25/03

Coping with CS 252

• Too many students with too varied background?

– Next Wednesday - Prequisite exam

• Limiting Number of Students– First priority is CS/ EECS grad students taking prelims– Second priority is N-th year CS/ EECS grad students

(breadth)– Third priority is College of Engineering grad students– Fourth priority is CS/EECS undergraduate seniors

(Note: 1 graduate course unit = 2 undergraduate course units)

– All other categories

• If not this semester, 252 is offered regularly


8/25/03

Coping with CS 252• Students with too varied background?

– In past, CS grad students took written prelim exams on undergraduate material in hardware, software, and theory

– 1st 5 weeks reviewed background, helped 252, 262, 270– Prelims were dropped => some unprepared for CS 252?

• In class exam on Wednesday September 3rd – Doesn’t affect grade, only admission into class– 2 grades: Admitted or audit/take CS 152 1st– Improve your experience if recapture common background

• Review: Chapters 1-3, CS 152 home page, maybe “Computer Organization and Design (COD)2/e”

– Chapters 1 to 8 of COD if never took prerequisite– If took a class, be sure COD Chapters 2, 6, 7 are familiar– Copies in Bechtel Library on 2-hour reserve– Last exam on previous-year’s web site

(~kubitron/courses/cs252-F00)


8/25/03

Building Hardwarethat Computes


8/25/03

Finite State Machines:• System state is explicit in representation• Transitions between states represented as

arrows with inputs on arcs.• Output may be either part of state or on arcs

Alpha/

0

Delta/

2

Beta/

10

1

1

0

0

1

“Mod 3 Machine”

Input (MSB first)

0 1 0 1 00 1 2 2

1

106

Mod 3

1

1

1 1

0


8/25/03

“M

eale

y M

ach

ine”“M

oore

Mach

ine”

Implementation as Combinational logic +

LatchAlpha/

0

Delta/

2

Beta/

10/0

1/0

1/1

0/10/0

1/1

Latc

h

Com

bin

ati

on

al

Log

ic

I nput Stateold Statenew Div

000

000110

001001

001

111

000110

010010

011


8/25/03

Microprogrammed Controllers• State machine in which part of state is a “micro-pc”.

– Explicit circuitry for incrementing or changing PC

• Includes a ROM with “microinstructions”.– Controlled logic implements at least branches and jumps

RO

M(In

stru

ctio

ns)

Addr

BranchPC

+ 1

MUX

Next Address

Control

0: forw 35 xxx1: b_no_obstacles 0002: back 10 xxx3: rotate 90 xxx4: goto 001

Instruction Branch

Com

bin

ati

on

al Log

ic/

Con

trolled

Mach

ineS

tate

w/ A

dd

ress


8/25/03

Execution Cycle

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in storage for later use

Determine successor instruction


8/25/03

What’s a Clock Cycle?

• Old days: 10 levels of gates• Today: determined by numerous time-

of-flight issues + gate delays– clock propagation, wire lengths, drivers

Latchor

register

combinationallogic


8/25/03

Pipelined Instruction Interpretation

Instruction Register

Operand Registers

Instruction Address

Result Registers

Next Instruction

Instruction Fetch

Decode &Operand Fetch

Execute

Store Results

NIIF

DE

W

NIIF

DE

W

NIIF

DE

W

NIIF

DE

W

NIIF

DE

W

Time

Registers or Mem


8/25/03

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 2030 40 2030 40 2030 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time


8/25/03

Pipelined LaundryStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20


8/25/03

Pipelining Lessons• Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20


8/25/03

The Process of Design

Design

Analysis

Architecture is an iterative process:• Searching the space of possible designs• At all levels of computer systems

Creativity

Good IdeasGood Ideas

Mediocre IdeasBad Ideas

Cost /PerformanceAnalysis


8/25/03

Measurement Tools

• Benchmarks, Traces, Mixes• Hardware: Cost, delay, area, power

estimation• Simulation (many levels)

– ISA, RT, Gate, Circuit

• Queuing Theory• Rules of Thumb• Fundamental “Laws”/Principles


8/25/03

The Bottom Line: Performance (and Cost)

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns … (Performance)

– Throughput, bandwidth

Plane

Boeing 747

BAD/Sud Concodre

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph)

286,700

178,200


8/25/03

Performance(X) Execution_time(Y)

n = =

Performance(Y) Execution_time(Y)

Definitions•Performance is in units of things per sec

– bigger is better

•If we are primarily concerned with response time–performance(x) = 1

execution_time(x)

" X is n times faster than Y" means


8/25/03

Amdahl’s Law

enhanced

enhancedenhanced

new

oldoverall

Speedup

Fraction Fraction

1

ExTimeExTime

Speedup

1

Best you could ever hope to do:

enhancedmaximum Fraction - 1

1 Speedup

enhanced

enhancedenhancedoldnew Speedup

FractionFraction ExTime ExTime 1


8/25/03

Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

TransistorsWiresPins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second


8/25/03

Computer Performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle



Inst Count CPI Clock RateProgram X

Compiler X (X)

Inst. Set. X X

Organization X X

Technology X

inst count

CPI

Cycle time


8/25/03

Cycles Per Instruction(Throughput)

“Instruction Frequency”

CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count

“Average Cycles per Instruction”

j

n

jj I CPI TimeCycle time CPU

1

Count nInstructio

I F where F CPI CPI j

j

n

jjj

1


8/25/03

Example: Calculating CPI bottom up

Typical Mix of instruction typesin program

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5


8/25/03

Example: Branch Stall Impact

• Assume CPI = 1.0 ignoring branches (ideal)• Assume solution was stalling for 3 cycles• If 30% branch, Stall 3 cycles on 30%

Op Freq Cycles CPI(i) (% Time)Other 70% 1 .7 (37%)Branch30% 4 1.2 (63%)

new CPI = 1.9

• New machine is 1/1.9 = 0.52 times faster (i.e. slow!)


8/25/03

Speed Up Equation for Pipelining

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline CPI Idealdepth Pipeline CPI Ideal

Speedup

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline 1depth Pipeline

Speedup

Instper cycles Stall Average CPI Ideal CPIpipelined

For simple RISC pipeline, CPI = 1:


8/25/03

SPEC: System Performance Evaluation Cooperative

• First Round 1989– 10 programs yielding a single number (“SPECmarks”)

• Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point

programs)» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:

spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

• Third Round 1995– new set of programs: SPECint95 (8 integer programs) and

SPECfp95 (10 floating point) – “benchmarks useful for 3 years”– Single flag setting for all programs: SPECint_base95,

SPECfp_base95

• Fourth Round 2000: 26 apps– analysis and simulation programs– Compression: bzip2, gzip, – Integrated circuit layout, ray tracing, lots of others


8/25/03

How to Summarize Performance

• Arithmetic mean (weighted arithmetic mean) tracks execution time:

(Ti)/n or (Wi*Ti)• Harmonic mean (weighted harmonic mean) of

rates (e.g., MFLOPS) tracks execution time: n/(1/Ri) or n/(Wi/Ri)

• Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10)

• But do not take the arithmetic mean of normalized execution time, use the geometric mean:

( Tj / Nj )1/n


8/25/03

SPEC First Round• One program: 99% of time in single line of

code• New front-end compiler could improve

dramatically

Benchmark

SP

EC

Pe

rf

0

100

200

300

400

500

600

700

800

gcc

epre

sso

spic

e

doduc

nasa

7 li

eqnto

tt

matr

ix300

fpppp

tom

catv


8/25/03

Performance Evaluation• “For better or worse, benchmarks shape a field”• Good products created when have:

– Good benchmarks– Good ways to summarize performance

• Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary

• If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins!

• Execution time is the measure of computer performance!


8/25/03

Integrated Circuits Costs

Die Cost goes roughly with die area4

Test_Die Die_Area 2

Wafer_diam

Die_Area

2m/2)(Wafer_dia wafer per Dies

Die_area sityDefect_Den

1 dWafer_yiel YieldDie

yieldtest Finalcost Packaging cost Testingcost Die

cost IC

yield Die Wafer per DiescostWafer

cost Die


8/25/03

Real World Examples

Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer

386DX 2 0.90 $900 1.0 43 360 71% $4

486DX2 3 0.80 $1200 1.0 81 181 54% $12

PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53

HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73

DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149

SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272

Pentium 3 0.80 $1500 1.5 296 40 9% $417

– From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15


8/25/03

Summary, #1• Designing to Last through Trends

Capacity Speed

Logic 2x in 3 years 2x in 3 years

SPEC RATING: 2x in 1.5 years

DRAM 4x in 3 years 2x in 10 years

Disk 4x in 3 years 2x in 10 years

• 6yrs to graduate => 16X CPU speed, DRAM/Disk size

• Time to run the task– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns, …– Throughput, bandwidth

• “X is n times faster than Y” means ExTime(Y) Performance(X)

--------- = --------------

ExTime(X) Performance(Y)


8/25/03

Summary, #2

• Amdahl’s Law:

• CPI Law:

• Execution time is the REAL measure of computer performance!

• Good products created when have:– Good benchmarks, good ways to summarize

performance• Die Cost goes roughly with die area4

Speedupoverall =ExTimeold

ExTimenew

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced





Documents

CS252/Kubiatowicz Lec 1.1 8/25/03 CS252 Graduate Computer Architecture Lecture 1 Review of Technology Trends and Cost/Performance August 25, 2003 Prof