CSE 141 – Computer Architecture Summer Session I, 2005 ...cseweb.ucsd.edu/classes/su05/cse141/su05_03.pdfPramod Argade UCSD CSE 141, Spring 2005 3 Summer Session I, 2005 CSE141 Course

CSE 141 – Computer ArchitectureSummer Session I, 2005

Lecture 3Performance and

Single Cycle CPU Part 1Pramod V. Argade

2Pramod Argade UCSD CSE 141, Spring 2005

CSE141: Introduction to Computer Architecture

Instructor: Pramod V. Argade ([email protected])Office Hour:

Tue. 7:30 - 9:00 (Center 105)Wed. 5:00 - 6:00 PM (HSS 1330)By Appointment

TAs:Chengmo Yang: [email protected] Rao: [email protected]

Lecture: Mon/Wed. 6:00 - 8:50 PM, HSS 1330

Textbook: Computer Organization & DesignThe Hardware Software Interface, 3nd Edition.Authors: Patterson and Hennessy

Web-page: http://www.cse.ucsd.edu/classes/su05/cse141


Summer Session I, 2005CSE141 Course Schedule

Lecture # Date Time Room Topic Quiz topic HomeworkDue

1 Mon. 6/27 6 - 8:50 PM HSS 1330 Introduction, Ch. 1ISA, Ch. 2 - -

2 Wed. 6/29 6 - 8:50 PM HSS 1330 Arithmetic, Ch. 3 ISACh. 2 #1

- Mon. 7/4 July 4th Holiday - -

3 Wed. 7/6 6 - 8:50 PM HSS 1330 Performance, Ch. 4Single-cycle CPU Ch. 5

ArithmeticCh. 3 #2

4 Mon. 7/11 6 - 8:50 PM HSS 1330 Single-cycle CPU Ch. 5 Cont.Multi-cycle CPU Ch. 5

PerformanceCh. 4 #3

5 Tue. 7/12 7:30 - 8:50 PM HSS 1330 Multi-cycle CPU Ch. 5 Cont.(July 4th make up class) - -

6 Wed. 7/13 6 - 8:50 PM HSS 1330 Single and Multicycle CPU Examples andReview for Midterm

Single-cycle CPUCh. 5

-

7 Mon. 7/18 6 - 8:50 PM HSS 1330 Mid-term ExamExceptions - #4

8 Tue. 7/19 7:30 - 8:50 PM HSS 1330 Pipelining Ch. 6(July 4th make up class) - -

9 Wed. 7/20 6 - 8:50 PM HSS 1330 Hazards, Ch. 6 - -

10 Mon. 7/25 6 - 8:50 PM HSS 1330 Memory Hierarchy & Caches Ch. 7 HazardsCh. 6 #5

11 Wed. 7/27 6 - 8:50 PM HSS 1330 Virtual Memory, Ch. 7Course Review

CacheCh. 7 #6

12 Sat. 7/30 TBD TBD Final Exam - -

No Class


AnnouncementsDiscussions Sections for 141: Thursday, 7:30 - 8:30 WLH-2205

Office Hour:– Tue. 7:30 - 9:00 (Center 105)– Wed. 5:00 - 6:00 PM (HSS 1330)– By Appointment

Reading Assignment– Chapter 4. Assessing and Understanding Performance

Sections 4.1 - 4.6Homework 3: Due Mon., July 11th in class4.1, 4.2, 4.6, 4.7, 4.8, 4.9, 4.12, 4.19, 4.22, 4.45

QuizWhen: Mon., July 11th, First 10 minutes of the classTopic: Performance, Chapter 4 Need: Paper, pen


Performance Assessment

Example definitions of performance– Passengers * MPH– Cruising speed– Passenger capacity– (Passenger * MPH)/$ need more data!

Clearly formulate your definition of performance

Airplane PassengerCapacity

Cruising Range(miles)

Cruising Speed(m.p.h.)

Passenger Throughput(passenger x

m.p.h.)Boeing 777 375 4630 610 228,750Boeing 747 470 4150 610 286,700

Douglas DC-8-50 146 8720 544 79,424Airbus A380 555 8000 645 357,975


Why worry about performance?

Learn to measure, report, and summarizeMake intelligent choicesSee through the marketing hype As a designer or purchaser, assess:– which system has best performance?– which system has lowest cost?– which system has highest performance/cost?

Formulate metrics for performance measurement– How to report relative performance?

Understand impact of architectural choices on performance


Response Time (latency)— How long does it take for my job to run?— How long does it take to execute a job?— How long must I wait for the database query?

Throughput— How many jobs can the machine run at once?— What is the average execution rate?— How much work is getting done?

If we upgrade a machine with a new processor what do we decrease?

If we add a new machine to the lab what do we increase?

Computer Performance: TIME, TIME, TIME


Elapsed Time (2:39)– Counts everything (disk and memory accesses, I/O , etc.)– A useful number, but often not good for comparison purposes

Total CPU time (103.6 s)– Doesn't count I/O or time spent running other programs– Comprised of system time (12.9 s), and user time (90.7 s)

Our focus: user CPU time – Time spent executing the lines of code that are "in" our program

Execution Time% time program... program results ...90.7u 12.9s 2:39 65%%


For some program running on machine X,PerformanceX = 1 / Execution timeX

“Machine X is n times faster than Y”n = PerformanceX / PerformanceY

= (Execution timeY) / (Execution timeX )Problem:Execution time: Machine A: 12 seconds, B: 20 seconds– A/B = .6 , so A is 40% faster, or 1.4X faster, or B is 40% slower– B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower

Need a precise definition“X is n times faster than Y”, n > 1

– A is 1.67 times faster than B

A Definition of Performance


Clock Cycles

Operation of conventional computer is controlled by Clock “ticks”

Clock rate (frequency) = cycles per second (Hertz)MHz = Million cycles per secondGHz = Giga (Billion) cycles per second

Cycle time = time between ticks = seconds per cycle– A 2 GHz clock => Cycle time = 1/(2x109) s = 0.5 nanoseconds

Instead of reporting execution time in seconds, we often use clock cycles

time

secondsprogram

=cycles

program×

secondscycle


Assumption: # of cycles = # of instructions

Assumption is incorrect– Typically, different instructions take different number of clock cycles

Integer Multiply instructions are multi-cycleFloating point instructions are multi-cycle

time

1st i

nstru

ctio

n

2nd

inst

ruct

ion

3rd

inst

ruct

ion

4th

5th

6th ...

How many cycles are required for a program?

1st 2nd 5th3rd 4th

time


A given program will require– some number of instructions (machine instructions)

– some number of cycles

– some number of seconds

We have a vocabulary that relates these quantities:– cycle time (seconds per cycle)

– clock rate (cycles per second)

– CPI (cycles per instruction)

Now that we understand cycles


A given program will require– some number of instructions (machine instructions)

– some number of cycles

– some number of seconds

We have a vocabulary that relates these quantities:– cycle time (seconds per cycle)

– clock rate (cycles per second)

– CPI (cycles per instruction)

Now that we understand cycles

CPU time = Seconds = Instructions x Cycles x SecondsProgram Program Instruction Cycle



Suppose we have two implementations of the same instruction set architecture (ISA).For some program,– Machine A has a clock cycle time of 10 ns. and a CPI of 2.0– Machine B has a clock cycle time of 20 ns. and a CPI of 1.2

What machine is faster for this program, and by how much?

CPI Example 2.1


A compiler designer is trying to decide between two code sequences for a particular machine with 3 instruction classes. – CPI: Class A = 1, Class B = 2, and Class C = 3– Code sequence X has 5 instructions:

2 of A, 1 of B, and 2 of C– Code sequence Y has 6 instructions:

4 of A, 1 of B, and 1 of C.

Which sequence will be faster? How much?

Compiler Optimization: Example 2.2


Execution time: Contributing Factors



Improve performance => reduce execution time– Reduce instruction count (Programmer, Compiler) – Reduce cycles per instruction (ISA, Machine designer)– Reduce clock cycle time (Hardware designer, Process engineer)

InstructionCount CPI Clock

RateProgram XCompiler X (X)ISA X XOrganization X XTechnology X


Performance

Performance is determined by program execution timeDo any of the other variables equal performance?– # of cycles to execute program?– # of instructions in program?– # of cycles per second?– average # of cycles per instruction?– average # of instructions per second?

Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.


Performance Variation

Number ofinstructions

CPI Clock CycleTime

Same machine differentprograms

different similar same

same programs, differentmachines, same ISA

same different different

Same programs,different machines

somewhatdifferent

different different

CPU Execution Time

Instruction Count

CPI Clock Cycle Time= X X


Amdahl’s Law

The impact of a performance improvement is limited by the percent of execution time affected by the improvement

Make the common case fast!Amdahl’s law sets limit on how much improvement can be made

Execution timeafter improvement

=Execution Time Affected

Amount of ImprovementExecution Time Unaffected+


Amdahl’s Law: Example 2.3Example: A program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?

Is it possible to get the program to run 5 times faster?


Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

TransistorsWiresPins

ISA

Function Units

(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second –MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per month

Useful Operations per second

Each metric has a place and a purpose, and each can be misused


Performance best determined by running a real application– Use programs typical of expected workload– Or, typical of expected class of applications

e.g., compilers/editors, scientific applications, graphics, etc.

Small benchmarks– nice for architects and designers– easy to standardize– can be abused

SPEC (System Performance Evaluation Cooperative)– companies have agreed on a set of real program and inputs– can still be abused (Intel’s “other” bug)– valuable indicator of performance (and compiler technology)

Benchmarks


SPEC ‘89

Compiler “enhancements” and performance

0

100

200

300

400

500

600

700

800

tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc

BenchmarkCompiler

Enhanced compiler

SP

EC p

erfo

rman

ce ra

tio


SPEC ‘95*Benchmark Descriptiongo Artificial intelligence; plays the game of Gom88ksim Motorola 88k chip simulator; runs test programgcc The Gnu C compiler generating SPARC codecompress Compresses and decompresses file in memoryli Lisp interpreterijpeg Graphic compression and decompressionperl Manipulates strings and prime numbers in the special-purpose programming language Perlvortex A database programtomcatv A mesh generation programswim Shallow water model with 513 x 513 gridsu2cor quantum physics; Monte Carlo simulationhydro2d Astrophysics; Hydrodynamic Naiver Stokes equationsmgrid Multigrid solver in 3-D potential fieldapplu Parabolic/elliptic partial differential equationstrub3d Simulates isotropic, homogeneous turbulence in a cubeapsi Solves problems regarding temperature, wind velocity, and distribution of pollutantfpppp Quantum chemistrywave5 Plasma physics; electromagnetic particle simulation

SPEC CPU2000 consists of 12 integer and 14 floating point benchmarks


SPEC ‘95

Does doubling the clock rate double the performance?Can a machine with a slower clock rate have better performance?

Clock rate (MHz)

SPE

Cin

t

2

0

4

6

8

3

1

5

7

9

10

200 25015010050

Pentium

Pentium Pro

PentiumClock rate (MHz)

SP

EC

fp

Pentium Pro

2

0

4

6

8

3

1

5

7

9

10

200 25015010050


MIPS, MFLOPS etc.

• Hard to relate MIPS/MFLOPS with execution time

• Highly program-dependent metrics• Deceptive (See pages 78 - 79 in the text)

• MIPS would be higher for a program using simple instructions (e.g. a loop of NOPs!)

• MIPS - million instructions per second

number of instructions executed in program Clock rateexecution time in seconds * 106 CPI * 106

• MFLOPS - million floating point operations per secondnumber of floating point operations executed in program

execution time in seconds * 106

= =

=

CPU Execution Time

Instruction Count

CPI Clock Cycle Time= X X


RISC Processor: Example 2.4Base Machine (Reg / Reg)Op Freq Cycles CPI(i) % TimeALU 50% 1Load 20% 5Store 10% 3Branch 20% 2

• What is average CPI?• What percentage of time is spent in each instruction class?• How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

• How does this compare with using branch prediction to shave a cycle off the branch time?

• What if two ALU instructions could be executed at once?


Performance is specific to particular program(s)– Total execution time is a consistent summary of performance

For a given architecture performance increases come from:– increases in clock rate (without adverse CPI affects)– improvements in processor organization that lower CPI– compiler enhancements that lower CPI and/or instruction count

Pitfall: expecting improvement in one aspect of a machine’s

performance to affect the total performance

You should not always believe everything you read! Read

carefully!

Performance assessment is tricky








Datapath and Control Design

The Five Classic Components of a Computer

Control

Datapath

Memory

ProcessorInput

Output


Combinational – Elements that operate on data values– Produces same output if given same inputs

State Elements– Contains internal storage– May produce different output if given same inputs– Clock is used to determine when a state element(s) should be

written

Two Types of Logic Components

CombinationalLogic

A

BC = f(A,B)

StateElement

clk

A

BC = f(A,B,state)


Clock

Clock is a free running signal– Fixed cycle time (period)– Frequency = 1/(cycle time)– Duty Cycle: (% high)/(%low), e.g. 50/50 Duty Cycle below– Jitter: Uncertainty/wobble in rising or falling edge

Clock Cycle (Period)

Rising Edge Falling Edge


Storage ElementsD Latch• Two inputs:

– the data value to be stored (D)– the clock signal (C) indicating when to read & store D

• Two outputs:– the value of the internal state (Q) and it's complement

Q

C

D

_Q

D

C

Q

Falling edge triggered D flip-flop• Output changes only on the clock edge

QQ

_Q

Q

_Q

D latch

D

C

D latch

DD

C

C

D

C

Q


Edge-triggered ClockingValues stored in the machine are updated on a clock edge– The clock edge can be either rising or falling

By default a state element is written every clock edge– An explicit write control signal is required otherwise.

Edge triggered methodology allows, in the same clock cycle to:– read the contents of a register– send the value through some combinational logic, and – write the contents of the same or another register

Possible to have the same state element as input and output

Clock cycle

Stateelement

1Combinational logic

Stateelement

2

Clock cycle

Stateelement

1Combinational logic

Stateelement

1


CPU: Clocking

Clk

Don’t CareSetup Hold

.

.

.

.

.

.

.

.

.

.

.

.

Setup Hold

All storage elements are clocked by the same clock edge

CLK CLK


Register: A Storage Element– Similar to the D Flip Flop except

• N-bit input and output• Write Enable input

– Write Enable:• 0: Data Out will not change• 1: Data Out will become Data In (on the clock

edge)

Clk

Data In

Write Enable

N N

Data Out


Register FileRegister File consists of (32) registers:– Two 32-bit output busses: busA and busB– One 32-bit input bus: busW

Register is selected by:– RA selects the register to put on busA– RB selects the register to put on busB– RW selects the register to be written

via busW when Write Enable is 1

Clock input (CLK)

Clk

busW

Write Enable

3232

busA

32busB

5 5 5RW RA RB

32 32-bitRegisters


Memory

Memory– One input bus: Data In– One output bus: Data Out

Memory word is selected by:– Address selects the word to put on Data Out– Write Enable = 1: address selects the memory word to be written

via the Data In bus

Clock input (CLK) – The CLK input is a factor ONLY during write operation– During read operation, behaves as a combinational logic block:

Address valid => Data Out valid after “access time.”

Clk

Data In

Write Enable

32 32DataOut

Address


Basic 4 x 2 Static RAM

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

2-to-4decoder

Write enable

Address

Din[0]Din[1]

Dout[1] Dout[0]

0

1

2

3


Register Transfer Language (RTL)

Mechanism for describing the movement and manipulation of of data between storage elements.Example:R[rd] R[rs] + R[rt]PC PC + 4R[rt] Mem[R[rs] + immediate]


A Simple Implementation of MIPS CPUSimplified to contain only:– Memory-reference instructions: lw, sw– Arithmetic-logical instructions: add, sub, and, or, slt

– Control flow instructions: beq, j

Execution Time = Instructions * CPI * Cycle TimeProcessor design (datapath and control) will determine:– Clock cycle time– Clock cycles per instruction

We will design a single cycle processor:– Advantage: One clock cycle per instruction– Disadvantage: long cycle time


Arithmetic Instructions (R-Type)

ADD, SUB, AND, OR, SLTExampleadd rd, rs, rt

e.g. add $t3, $s0, $s5REG[$t3] = REG[$s0] + REG[$s5]

op rs rt rd shamt funct061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits

s0 s5 t3


Load/Store Instructions (I-Type)

LW, SWExampleslw rt, rs, imm16sw rt, rs, imm16

e.g. lw $s3, -4($s2)REG[$s3] = D-MEM[ REG[$s2] - 4 ]

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

s2 s3 -4


Branch (I-Type)

BeqExamplebeq rs, rt, imm16

e.g.0x4c beq $s1, $t3, -12if( REG[$s1] == REG[$t3] ) {

new_PC = old_PC + 4 - 12 # new_PC = 0x44}else {

new_PC = old_PC + 4 # new_PC = 0x50}

op rs rt displacement016212631


s1 t3 -3


Jump (J-Type)

JExampleJ Label

e.g.0x8000 0000 j Label…

Label: 0x8444 4444 ADD $s3, $s5, $t1…

new_PC = 0x8444 4444

op target address02631

6 bits 26 bits

0x111 1111


Single Cycle Implementation Datapath and Control

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

I. Fe

tch

Dec

ode

Op.

Fet

ch

Exec

ute

Stor

e

Nex

t PC

Clock Cycle

Complete Execution of a Single Instruction


Components Required to implement the ISANext PC generation– Add 4 or extended 16-bit immediate to PC

Memory– Instruction read– Data read/write

Registers (32 x 32-bit)– Read register rs– Read register rt– Write register rt or rd

Sign extend immediate operandALU to operate on the operands


Abstract / Simplified View:

Datapath

RegistersRegister #

Data

Register #

Data memory

Address

Data

Register #

PC Instruction ALU

Instruction memory

Address


CPU: Instruction Fetch

• RTL version of the instruction fetch step: • Fetch the Instruction: mem[PC]– Update the program counter:

• Sequential Code: PC PC + 4 • Branch and Jump: PC “something else”

32Instruction Word

Address

InstructionMemory

PCClk

4

Adder


CPU: Register-Register Operations (Add, Subtract etc.)

R[rd] R[rs] op R[rt] Example: addU rd, rs, rt– Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields– ALUctr and RegWr: control logic after decoding the instruction

32Result

ALUctr

Clk

busW

RegWr

3232

busA

32busB

5 5 5

Rw Ra Rb32 32-bitRegisters

Rs RtRdA

LU

op rs rt rd shamt funct061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits


CPU: Load Operations

R[rt] Mem[R[rs] + SignExt[imm16]] Example: lw rt, rs, imm16



Clk

busW

RegWr

3232

busA

32busB

5 5 5


Rs

RtRd

RegDst

Extender

Mux

Mux

3216

imm16

ALUSrc

ExtOp

Clk

Data InWrEn

32

Adr

DataMemory

32

32

ALUctr

ALU

MemWr Mux

W_Src

Rt


CPU: Store Operations

Mem[ R[rs] + SignExt[imm16] R[rt] ] Example: sw rt, rs, imm16

32

ALUctr

Clk

busW

RegWr

3232

busA

32busB

55 5


Rs

Rt

Rt

RdRegDst

Extender

Mux

Mux

3216imm16

ALUSrcExtOp

Clk

Data InWrEn

32Adr

DataMemory

MemWr

ALU



32

Mux

W_Src


CPU: Datapath for Branching

beq rs, rt, imm16 Datapath generates condition (equal)



32

imm16

PC

Clk

00

Adder

Mux

Adder

4nPC_sel

Clk

busW

RegWr

32

busA

32busB

5 5 5


Rs Rt

Equa

l?

Cond

PC Ext

Instruction Address

Sign extend to 32 bits and left shift by 2


CPU: Binary arithmetic for PC

• In theory, the PC is a 32-bit byte address into the instruction memory:– Sequential operation: PC<31:0> = PC<31:0> + 4– Branch operation: PC<31:0> = PC<31:0> + 4 + SignExt[Imm16] * 4

• The magic number “4” always comes up because:– The 32-bit PC is a byte address– And all our instructions are 4 bytes (32 bits) long

• In other words:– The 2 LSBs of the 32-bit PC are always zeros– There is no reason to have hardware to keep the 2 LSBs

• In practice, we can simplify the hardware by using a 30-bit PC<31:2>:– Sequential operation: PC<31:2> = PC<31:2> + 1– Branch operation: PC<31:2> = PC<31:2> + 1 + SignExt[Imm16]– In either case: Instruction Memory Address = PC<31:2> concat “00”


Single Cycle Implementation

Putting it all together– Supports Arithmetic, Load/Store and Branch instructions

MemtoReg

MemRead

MemWrite

ALUOp

ALUSrc

RegDst

PC

Instruction memory

Read address

Instruction [31– 0]



Add


RegWrite

4

16 32Instruction [15– 0]

0Registers

Write registerWrite data

Write data

Read data 1

Read data 2

Read register 1Read register 2

Sign extend

ALU result

Zero

Data memory

Address Read data M

u x

1

0

M u x

1

0

M u x

1

0

M u x

1


ALU control

Shift left 2

PCSrc

ALU

Add ALU result







Documents

CSE 141 – Computer Architecture Summer Session I, 2005 ...cseweb.ucsd.edu/classes/su05/cse141/su05_03.pdfPramod Argade UCSD CSE 141, Spring 2005 3 Summer Session I, 2005 CSE141 Course