100
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Lecture 8: “Processor Architecture and Design” John P. Shen & Zhiyi Yu September 26, 2016 9/26/2016 (©J.P. Shen) 18-600 Lecture #8 1 18 - 600 Foundations of Computer Systems CS: AAP CS: APP 18-600 Required Reading Assignment: Chapter 4 of CS:APP (3 rd edition) by Randy Bryant & Dave O’Hallaron. Recommended Reference: Chapters 1 and 2 of Shen and Lipasti (SnL).

Bryant and O’Hallaron , Computer Systems: A Programmer’s …ece600/fall16/lectures/... · 2016-09-26 · Processor Organization Design c. Y86-64 Sequential Processor. 3. Motivation

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

John P Shen amp Zhiyi YuSeptember 26 2016

9262016 (copyJP Shen) 18-600 Lecture 8 1

18-600 Foundations of Computer Systems

CS AAP

CS APP

18-600

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 2

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ComputerEngineering

Application Softwarebull Program developmentbull Program compilation

Hardware Technologybull Program executionbull Computer performance

Computer Instruction Set Architecture

9262016 (copyJP Shen) 18-600 Lecture 8 3

Instruction Set Architecture (ISA) SPECIFICATION

IMPLEMENTATION

SoftwareEngineering

Proc

esso

rDe

sign

Prog

ram

Deve

lopm

ent

CS APP

CS AAP

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Set Architecture Assembly Language View

bull Processor state (architecture state)bull Registers Memory hellip

bull Instructions (specify operations operands)bull addq pushq ret hellipbull How instructions are encoded as bytes

Layer of Abstractionbull Above how to program machine

bull Processor executes instructions in a sequencebull Below what needs to be implemented

bull Use variety of uarch techniques to make it run fastbull Eg execute multiple instructions simultaneously

ISA

Compiler OS

CPUDesign

CircuitDesign

ChipLayout

ApplicationProgram

9262016 (copyJP Shen) 18-600 Lecture 8 4

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC Instruction Sets Complex Instruction Set Computer IA32 is example

Stack-oriented instruction set Use stack to pass arguments save program counter Explicit push and pop instructions

Arithmetic instructions can access memory addq rax 12(rbxrcx8)

requires memory read and write Complex address calculation

Condition codes Set as side effect of arithmetic and logical instructions

Philosophy Add instructions to perform ldquotypicalrdquo programming tasks

9262016 (copyJP Shen) 18-600 Lecture 8 5

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

RISC Instruction Sets Reduced Instruction Set Computer Internal project at IBM later popularized by Hennessy (Stanford) and Patterson (Berkeley)

Fewer simpler instructions Might take more to get given task done Can execute them with small and fast hardware

Register-oriented instruction set Many more (typically 32) registers Use for arguments return pointer temporaries

Only load and store instructions can access memory Similar to Y86-64 mrmovq and rmmovq

No Condition codes Test instructions return 01 in register

9262016 (copyJP Shen) 18-600 Lecture 8 6

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Quantum Physics

Transistors amp Devices

Logic Gates amp Memory

Von Neumann Machine

x86 Machine Primitives

Visual C++

Firefox MS Excel

Windows

Computer Dynamic-Static Interface

9262016 (copyJP Shen) 18-600 Lecture 8 7

DynamicStatic Interface (DSI)=(ISA)

ldquosta

ticrdquo

ldquodyn

amic

rdquo

Architectural State

Microarchitecture State

MACHINE

PROGRAM

Exposed to SW

Hidden in HW

Buffering needed between arch and uarch statesbull Allow uarch state to deviate from arch statebull Able to undo speculative uarch state if needed

Architectural state requirementsbull Support sequential instruction execution semanticsbull Support precise servicing of exceptions amp interrupts

ARCHITECTURE

DSI = ISA = a contract between the program and the machine

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 2

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ComputerEngineering

Application Softwarebull Program developmentbull Program compilation

Hardware Technologybull Program executionbull Computer performance

Computer Instruction Set Architecture

9262016 (copyJP Shen) 18-600 Lecture 8 3

Instruction Set Architecture (ISA) SPECIFICATION

IMPLEMENTATION

SoftwareEngineering

Proc

esso

rDe

sign

Prog

ram

Deve

lopm

ent

CS APP

CS AAP

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Set Architecture Assembly Language View

bull Processor state (architecture state)bull Registers Memory hellip

bull Instructions (specify operations operands)bull addq pushq ret hellipbull How instructions are encoded as bytes

Layer of Abstractionbull Above how to program machine

bull Processor executes instructions in a sequencebull Below what needs to be implemented

bull Use variety of uarch techniques to make it run fastbull Eg execute multiple instructions simultaneously

ISA

Compiler OS

CPUDesign

CircuitDesign

ChipLayout

ApplicationProgram

9262016 (copyJP Shen) 18-600 Lecture 8 4

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC Instruction Sets Complex Instruction Set Computer IA32 is example

Stack-oriented instruction set Use stack to pass arguments save program counter Explicit push and pop instructions

Arithmetic instructions can access memory addq rax 12(rbxrcx8)

requires memory read and write Complex address calculation

Condition codes Set as side effect of arithmetic and logical instructions

Philosophy Add instructions to perform ldquotypicalrdquo programming tasks

9262016 (copyJP Shen) 18-600 Lecture 8 5

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

RISC Instruction Sets Reduced Instruction Set Computer Internal project at IBM later popularized by Hennessy (Stanford) and Patterson (Berkeley)

Fewer simpler instructions Might take more to get given task done Can execute them with small and fast hardware

Register-oriented instruction set Many more (typically 32) registers Use for arguments return pointer temporaries

Only load and store instructions can access memory Similar to Y86-64 mrmovq and rmmovq

No Condition codes Test instructions return 01 in register

9262016 (copyJP Shen) 18-600 Lecture 8 6

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Quantum Physics

Transistors amp Devices

Logic Gates amp Memory

Von Neumann Machine

x86 Machine Primitives

Visual C++

Firefox MS Excel

Windows

Computer Dynamic-Static Interface

9262016 (copyJP Shen) 18-600 Lecture 8 7

DynamicStatic Interface (DSI)=(ISA)

ldquosta

ticrdquo

ldquodyn

amic

rdquo

Architectural State

Microarchitecture State

MACHINE

PROGRAM

Exposed to SW

Hidden in HW

Buffering needed between arch and uarch statesbull Allow uarch state to deviate from arch statebull Able to undo speculative uarch state if needed

Architectural state requirementsbull Support sequential instruction execution semanticsbull Support precise servicing of exceptions amp interrupts

ARCHITECTURE

DSI = ISA = a contract between the program and the machine

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ComputerEngineering

Application Softwarebull Program developmentbull Program compilation

Hardware Technologybull Program executionbull Computer performance

Computer Instruction Set Architecture

9262016 (copyJP Shen) 18-600 Lecture 8 3

Instruction Set Architecture (ISA) SPECIFICATION

IMPLEMENTATION

SoftwareEngineering

Proc

esso

rDe

sign

Prog

ram

Deve

lopm

ent

CS APP

CS AAP

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Set Architecture Assembly Language View

bull Processor state (architecture state)bull Registers Memory hellip

bull Instructions (specify operations operands)bull addq pushq ret hellipbull How instructions are encoded as bytes

Layer of Abstractionbull Above how to program machine

bull Processor executes instructions in a sequencebull Below what needs to be implemented

bull Use variety of uarch techniques to make it run fastbull Eg execute multiple instructions simultaneously

ISA

Compiler OS

CPUDesign

CircuitDesign

ChipLayout

ApplicationProgram

9262016 (copyJP Shen) 18-600 Lecture 8 4

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC Instruction Sets Complex Instruction Set Computer IA32 is example

Stack-oriented instruction set Use stack to pass arguments save program counter Explicit push and pop instructions

Arithmetic instructions can access memory addq rax 12(rbxrcx8)

requires memory read and write Complex address calculation

Condition codes Set as side effect of arithmetic and logical instructions

Philosophy Add instructions to perform ldquotypicalrdquo programming tasks

9262016 (copyJP Shen) 18-600 Lecture 8 5

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

RISC Instruction Sets Reduced Instruction Set Computer Internal project at IBM later popularized by Hennessy (Stanford) and Patterson (Berkeley)

Fewer simpler instructions Might take more to get given task done Can execute them with small and fast hardware

Register-oriented instruction set Many more (typically 32) registers Use for arguments return pointer temporaries

Only load and store instructions can access memory Similar to Y86-64 mrmovq and rmmovq

No Condition codes Test instructions return 01 in register

9262016 (copyJP Shen) 18-600 Lecture 8 6

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Quantum Physics

Transistors amp Devices

Logic Gates amp Memory

Von Neumann Machine

x86 Machine Primitives

Visual C++

Firefox MS Excel

Windows

Computer Dynamic-Static Interface

9262016 (copyJP Shen) 18-600 Lecture 8 7

DynamicStatic Interface (DSI)=(ISA)

ldquosta

ticrdquo

ldquodyn

amic

rdquo

Architectural State

Microarchitecture State

MACHINE

PROGRAM

Exposed to SW

Hidden in HW

Buffering needed between arch and uarch statesbull Allow uarch state to deviate from arch statebull Able to undo speculative uarch state if needed

Architectural state requirementsbull Support sequential instruction execution semanticsbull Support precise servicing of exceptions amp interrupts

ARCHITECTURE

DSI = ISA = a contract between the program and the machine

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Set Architecture Assembly Language View

bull Processor state (architecture state)bull Registers Memory hellip

bull Instructions (specify operations operands)bull addq pushq ret hellipbull How instructions are encoded as bytes

Layer of Abstractionbull Above how to program machine

bull Processor executes instructions in a sequencebull Below what needs to be implemented

bull Use variety of uarch techniques to make it run fastbull Eg execute multiple instructions simultaneously

ISA

Compiler OS

CPUDesign

CircuitDesign

ChipLayout

ApplicationProgram

9262016 (copyJP Shen) 18-600 Lecture 8 4

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC Instruction Sets Complex Instruction Set Computer IA32 is example

Stack-oriented instruction set Use stack to pass arguments save program counter Explicit push and pop instructions

Arithmetic instructions can access memory addq rax 12(rbxrcx8)

requires memory read and write Complex address calculation

Condition codes Set as side effect of arithmetic and logical instructions

Philosophy Add instructions to perform ldquotypicalrdquo programming tasks

9262016 (copyJP Shen) 18-600 Lecture 8 5

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

RISC Instruction Sets Reduced Instruction Set Computer Internal project at IBM later popularized by Hennessy (Stanford) and Patterson (Berkeley)

Fewer simpler instructions Might take more to get given task done Can execute them with small and fast hardware

Register-oriented instruction set Many more (typically 32) registers Use for arguments return pointer temporaries

Only load and store instructions can access memory Similar to Y86-64 mrmovq and rmmovq

No Condition codes Test instructions return 01 in register

9262016 (copyJP Shen) 18-600 Lecture 8 6

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Quantum Physics

Transistors amp Devices

Logic Gates amp Memory

Von Neumann Machine

x86 Machine Primitives

Visual C++

Firefox MS Excel

Windows

Computer Dynamic-Static Interface

9262016 (copyJP Shen) 18-600 Lecture 8 7

DynamicStatic Interface (DSI)=(ISA)

ldquosta

ticrdquo

ldquodyn

amic

rdquo

Architectural State

Microarchitecture State

MACHINE

PROGRAM

Exposed to SW

Hidden in HW

Buffering needed between arch and uarch statesbull Allow uarch state to deviate from arch statebull Able to undo speculative uarch state if needed

Architectural state requirementsbull Support sequential instruction execution semanticsbull Support precise servicing of exceptions amp interrupts

ARCHITECTURE

DSI = ISA = a contract between the program and the machine

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC Instruction Sets Complex Instruction Set Computer IA32 is example

Stack-oriented instruction set Use stack to pass arguments save program counter Explicit push and pop instructions

Arithmetic instructions can access memory addq rax 12(rbxrcx8)

requires memory read and write Complex address calculation

Condition codes Set as side effect of arithmetic and logical instructions

Philosophy Add instructions to perform ldquotypicalrdquo programming tasks

9262016 (copyJP Shen) 18-600 Lecture 8 5

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

RISC Instruction Sets Reduced Instruction Set Computer Internal project at IBM later popularized by Hennessy (Stanford) and Patterson (Berkeley)

Fewer simpler instructions Might take more to get given task done Can execute them with small and fast hardware

Register-oriented instruction set Many more (typically 32) registers Use for arguments return pointer temporaries

Only load and store instructions can access memory Similar to Y86-64 mrmovq and rmmovq

No Condition codes Test instructions return 01 in register

9262016 (copyJP Shen) 18-600 Lecture 8 6

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Quantum Physics

Transistors amp Devices

Logic Gates amp Memory

Von Neumann Machine

x86 Machine Primitives

Visual C++

Firefox MS Excel

Windows

Computer Dynamic-Static Interface

9262016 (copyJP Shen) 18-600 Lecture 8 7

DynamicStatic Interface (DSI)=(ISA)

ldquosta

ticrdquo

ldquodyn

amic

rdquo

Architectural State

Microarchitecture State

MACHINE

PROGRAM

Exposed to SW

Hidden in HW

Buffering needed between arch and uarch statesbull Allow uarch state to deviate from arch statebull Able to undo speculative uarch state if needed

Architectural state requirementsbull Support sequential instruction execution semanticsbull Support precise servicing of exceptions amp interrupts

ARCHITECTURE

DSI = ISA = a contract between the program and the machine

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

RISC Instruction Sets Reduced Instruction Set Computer Internal project at IBM later popularized by Hennessy (Stanford) and Patterson (Berkeley)

Fewer simpler instructions Might take more to get given task done Can execute them with small and fast hardware

Register-oriented instruction set Many more (typically 32) registers Use for arguments return pointer temporaries

Only load and store instructions can access memory Similar to Y86-64 mrmovq and rmmovq

No Condition codes Test instructions return 01 in register

9262016 (copyJP Shen) 18-600 Lecture 8 6

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Quantum Physics

Transistors amp Devices

Logic Gates amp Memory

Von Neumann Machine

x86 Machine Primitives

Visual C++

Firefox MS Excel

Windows

Computer Dynamic-Static Interface

9262016 (copyJP Shen) 18-600 Lecture 8 7

DynamicStatic Interface (DSI)=(ISA)

ldquosta

ticrdquo

ldquodyn

amic

rdquo

Architectural State

Microarchitecture State

MACHINE

PROGRAM

Exposed to SW

Hidden in HW

Buffering needed between arch and uarch statesbull Allow uarch state to deviate from arch statebull Able to undo speculative uarch state if needed

Architectural state requirementsbull Support sequential instruction execution semanticsbull Support precise servicing of exceptions amp interrupts

ARCHITECTURE

DSI = ISA = a contract between the program and the machine

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Quantum Physics

Transistors amp Devices

Logic Gates amp Memory

Von Neumann Machine

x86 Machine Primitives

Visual C++

Firefox MS Excel

Windows

Computer Dynamic-Static Interface

9262016 (copyJP Shen) 18-600 Lecture 8 7

DynamicStatic Interface (DSI)=(ISA)

ldquosta

ticrdquo

ldquodyn

amic

rdquo

Architectural State

Microarchitecture State

MACHINE

PROGRAM

Exposed to SW

Hidden in HW

Buffering needed between arch and uarch statesbull Allow uarch state to deviate from arch statebull Able to undo speculative uarch state if needed

Architectural state requirementsbull Support sequential instruction execution semanticsbull Support precise servicing of exceptions amp interrupts

ARCHITECTURE

DSI = ISA = a contract between the program and the machine

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CISC vs RISCOriginal Debate

Strong opinions CISC proponents---easy for compiler fewer code bytes RISC proponents---better for optimizing compilers can make machine run fast with

simple chip design

Current Status For desktop processors choice of ISA not a technical issue

With enough hardware can make anything run fast Code compatibility more important

x86-64 adopted many RISC features More registers use them for argument passing

For embedded processors RISC makes sense Smaller cheaper less power Most mobile phones use ARM processors

9262016 (copyJP Shen) 18-600 Lecture 8 8

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Registers$0$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15

$0$at$v0$v1$a0$a1$a2$a3$t0$t1$t2$t3$t4$t5$t6$t7

Constant 0Reserved Temp

Return Values

Procedure arguments

Caller SaveTemporariesMay be overwritten bycalled procedures

$16$17$18$19$20$21$22$23$24$25$26$27$28$29$30$31

$s0$s1$s2$s3$s4$s5$s6$s7$t8$t9$k0$k1$gp$sp$s8$ra

Reserved forOperating Sys

Caller Save Temp

Global Pointer

Callee SaveTemporariesMay not beoverwritten bycalled procedures

Stack PointerCallee Save TempReturn Address

9262016 (copyJP Shen) 18-600 Lecture 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

MIPS Instruction Examples

Op Ra Rb Offset

Op Ra Rb Rd Fn00000R-R

Op Ra Rb ImmediateR-I

LoadStore

addu $3$2$1 Register add $3 = $2+$1

addu $3$2 3145 Immediate add $3 = $2+3145

sll $3$22 Shift left $3 = $2 ltlt 2

lw $316($2) Load Word $3 = M[$2+16]

sw $316($2) Store Word M[$2+16] = $3

Op Ra Rb OffsetBranch

beq $3$2dest Branch when $3 = $2

9262016 (copyJP Shen) 18-600 Lecture 8 10

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Evolutions of MIPS and ARM ISAs

9262016 (copyJP Shen) 18-600 Lecture 8 11

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ISA Design Summary

How Important is ISA Design Less now than before

With enough hardware can make almost anything go fast Installed based of legacy code and SW development tools dominate

Y86-64 Instruction Set Architecture Similar state and instructions as x86-64 Simpler encodings Somewhere between CISC and RISC

9262016 (copyJP Shen) 18-600 Lecture 8 12

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Processor Statebull Program Registers

bull 15 registers (omit r15) Each 64 bits

bull Condition Codesbull Single-bit flags set by arithmetic or logical instructions

bull ZF Zero SFNegative OF Overflow

bull Program Counterbull Indicates address of next instruction

bull Program Statusbull Indicates either normal operation or some error condition

bull Memorybull Byte-addressable storage arraybull Words stored in little-endian byte order

ZFSFOF

RF Program registers

CC Condition

codes

PC

DMEM Memory

Stat Program status

r8

r9

r10

r11

r12

r13

r14

rax

rcx

rdx

rbx

rsp

rbp

rsi

rdi

9262016 (copyJP Shen) 18-600 Lecture 8 13

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Set

18-600 Lecture 89262016 (copyJP Shen) 14

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Instruction Formats

Formatbull 1ndash10 bytes of information read from memory

bull Can determine instruction length from first bytebull Fewer instruction types and simpler encoding than x86-64

bull Each instruction accesses and modifies some part(s) of the program state

9262016 (copyJP Shen) 18-600 Lecture 8 15

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

18-600 Lecture 89262016 (copyJP Shen) 16

0 1 2 3 4 5 6 7 8 9

V

D

D

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB

rmmovq rA D(rB) 4 0 rA rB

mrmovq D(rB) rA 5 0 rA rB

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

rrmovq 2 0

cmovle 2 1

cmovl 2 2

cmove 2 3

cmovne 2 4

cmovge 2 5

cmovg 2 6

Y86-64 Instructions (moves)

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

18-600 Lecture 89262016 (copyJP Shen) 17

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9

addq 6 0

subq 6 1

andq 6 2

xorq 6 3

Y86-64 Instructions (ALUs)

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

18-600 Lecture 89262016 (copyJP Shen) 18

Byte

pushq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 F rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

0 1 2 3 4 5 6 7 8 9jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

Y86-64 Instructions (branches)

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Encoding Register Specifiers

Each register has 4-bit ID

bull Same encoding as in x86-64 Register ID 15 (0xF) indicates ldquono registerrdquo

bull Will use this in our hardware design in multiple places

rax

rcx

rdx

rbx

0

1

2

3

rsp

rbp

rsi

rdi

4

5

6

7

r8

r9

r10

r11

8

9

A

B

r12

r13

r14

No Register

C

D

E

F

9262016 (copyJP Shen) 18-600 Lecture 8 19

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Format ExampleAddition Instruction

Add value in register rA to that in register rB Store result in register rB Note that Y86-64 only allows addition to be applied to register data

Set condition codes based on result eg addq raxrsi Encoding 60 06

Two-byte encoding First indicates instruction type Second gives source and destination registers

addq rA rB 6 0 rA rB

Encoded Representation

Generic Form

9262016 (copyJP Shen) 18-600 Lecture 8 20

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Arithmetic and Logical Operations

Refer to generically as ldquoOPqrdquo Encodings differ only by ldquofunction coderdquo

Low-order 4 bits in first instruction byte

Set condition codes as side effect

addq rA rB 6 0 rA rB

subq rA rB 6 1 rA rB

andq rA rB 6 2 rA rB

xorq rA rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

9262016 (copyJP Shen) 18-600 Lecture 8 21

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Operations

Like the x86-64 movq instruction Simpler format for memory addresses Give different names to keep them distinct

rrmovq rA rB 2 0

Register Register

Immediate Register

irmovq V rB F rB3 0 V

Register Memory

rmmovq rA D(rB) 4 0 rA rB D

Memory Register

mrmovq D(rB) rA 5 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 22

rA rB

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Move Instruction Examples

irmovq $0xabcd rdxmovq $0xabcd rdx

30 82 cd ab 00 00 00 00 00 00

X86-64 Y86-64

Encoding

rrmovq rsp rbxmovq rsp rbx

20 43

mrmovq -12(rbp)rcxmovq -12(rbp)rcx

50 15 f4 ff ff ff ff ff ff ff

rmmovq rsi0x41c(rsp)movq rsi0x41c(rsp)

40 64 1c 04 00 00 00 00 00 00

Encoding

Encoding

Encoding

9262016 (copyJP Shen) 18-600 Lecture 8 23

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Conditional Move Instructions

Refer to generically as ldquocmovXXrdquo Encodings differ only by ldquofunction coderdquo Based on values of condition codes Variants of rrmovq instruction

(Conditionally) copy value from source to destination register

rrmovq rA rB

Move Unconditionally

cmovle rA rBMove When Less or Equal

cmovl rA rB

Move When Less

cmove rA rB

Move When Equal

cmovne rA rB

Move When Not Equal

cmovge rA rB

Move When Greater or Equal

cmovg rA rB

Move When Greater

2 0 rA rB

2 1 rA rB

2 2 rA rB

2 3 rA rB

2 4 rA rB

2 5 rA rB

2 6 rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 24

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

Refer to generically as ldquojXXrdquo Encodings differ only by ldquofunction coderdquo fn Based on values of condition codes Same as x86-64 counterparts Encode full destination address

Unlike PC-relative addressing seen in x86-64

jXX Dest 7 fn

Jump (Conditionally)

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 25

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Jump Instructions

jmp Dest 7 0

Jump Unconditionally

Dest

jle Dest 7 1

Jump When Less or Equal

Dest

jl Dest 7 2

Jump When Less

Dest

je Dest 7 3

Jump When Equal

Dest

jne Dest 7 4

Jump When Not Equal

Dest

jge Dest 7 5

Jump When Greater or Equal

Dest

jg Dest 7 6

Jump When Greater

Dest

9262016 (copyJP Shen) 18-600 Lecture 8 26

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Stack

Region of memory holding program data Used in Y86-64 (and x86-64) for supporting

procedure calls Stack top indicated by rsp

Address of top stack element

Stack grows toward lower addresses Bottom element is at highest address in the stackWhen pushing must first decrement stack pointer After popping increment stack pointer

rsp

bull

bull

bull

IncreasingAddresses

Stack ldquoToprdquo

Stack ldquoBottomrdquo

9262016 (copyJP Shen) 18-600 Lecture 8 27

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stack Operations

Decrement rsp by 8 Store word from rA to memory at rsp Like x86-64

Read word from memory at rsp Save in rA Increment rsp by 8 Like x86-64

pushq rA A 0 rA F

popq rA B 0 rA F

9262016 (copyJP Shen) 18-600 Lecture 8 28

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Subroutine Call and Return

Push address of next instruction onto stack Start executing instructions at Dest Like x86-64

Pop value from stack Use as address for next instruction Like x86-64

call Dest 8 0 Dest

ret 9 0

9262016 (copyJP Shen) 18-600 Lecture 8 29

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Miscellaneous Instructions

Donrsquot do anything

Stop executing instructions x86-64 has comparable instruction but canrsquot execute it in

user mode We will use it to stop the simulator Encoding ensures that program hitting memory initialized to

zero will halt

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 30

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Status Conditions

Mnemonic CodeADR 3

Mnemonic CodeINS 4

Mnemonic CodeHLT 2

Mnemonic CodeAOK 1

Normal operation

Halt instruction encountered

Bad address (either instruction or data) encountered

Invalid instruction encountered

Desired Behavior If AOK keep going Otherwise stop program execution

9262016 (copyJP Shen) 18-600 Lecture 8 31

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Writing Y86-64 CodeTry to Use C Compiler as Much as Possible

Write code in C Compile for x86-64 with gcc ndashOg ndashS

Transliterate into Y86-64 Modern compilers make this more difficult

Coding Example Find number of elements in null-terminated list

int len1(int a[])

5043

6125

7395

0

a

rArr 3

9262016 (copyJP Shen) 18-600 Lecture 8 32

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation ExampleFirst Try

Write typical array code

Compile with gcc -Og -S

Problem Hard to do array indexing on Y86-64

Since donrsquot have scaled addressing modes

Find number of elements innull-terminated list

long len(long a[])

long lenfor (len = 0 a[len] len++)

return len

L3addq $1raxcmpq $0 (rdirax8)jne L3

9262016 (copyJP Shen) 18-600 Lecture 8 33

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 2Second Try

Write C code that mimics expected Y86-64 code

Result Compiler generates exact same code as

before Compiler converts both versions into same

intermediate formlong len2(long a)

long ip = (long) along val = (long ) iplong len = 0while (val)

ip += sizeof(long)len++val = (long ) ip

return len

9262016 (copyJP Shen) 18-600 Lecture 8 34

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Code Generation Example 3

18-600 Lecture 89262016 (copyJP Shen) 35

lenirmovq $1 r8 Constant 1irmovq $8 r9 Constant 8irmovq $0 rax len = 0mrmovq (rdi) rdx val = aandq rdx rdx Test valje Done If zero goto Done

Loopaddq r8 rax len++addq r9 rdi a++mrmovq (rdi) rdx val = aandq rdx rdx Test valjne Loop If 0 goto Loop

Doneret

Register Userdi a

rax len

rdx val

r8 1

r9 8

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Sample Program Structure 1

Program starts at address 0 Must set up stack

Where located Pointer values Make sure donrsquot overwrite code

Must initialize data

init Initialization call Mainhalt

align 8 Program dataarray

Main Main function call len

len Length function

pos 0x100 Placement of stackStack

9262016 (copyJP Shen) 18-600 Lecture 8 36

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 2

Program starts at address 0 Must set up stack Must initialize data Can use symbolic names

init Set up stack pointerirmovq Stack rsp Execute main programcall Main Terminatehalt

Array of 4 elements + terminating 0align 8

Arrayquad 0x000d000d000d000dquad 0x00c000c000c000c0quad 0x0b000b000b000b00quad 0xa000a000a000a000quad 0

9262016 (copyJP Shen) 18-600 Lecture 8 37

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Y86-64 Program Structure 3

Set up call to len Follow x86-64 procedure conventions Push array address as argument

Main irmovq arrayrdi call len(array)call lenret

9262016 (copyJP Shen) 18-600 Lecture 8 38

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Assembling Y86-64 Program

Generates ldquoobject coderdquo file lenyo Actually looks like disassembler output

unixgt yas lenys

0x054 | len0x054 30f80100000000000000 | irmovq $1 r8 Constant 10x05e 30f90800000000000000 | irmovq $8 r9 Constant 80x068 30f00000000000000000 | irmovq $0 rax len = 00x072 50270000000000000000 | mrmovq (rdi) rdx val = a0x07c 6222 | andq rdx rdx Test val0x07e 73a000000000000000 | je Done If zero goto Done0x087 | Loop0x087 6080 | addq r8 rax len++0x089 6097 | addq r9 rdi a++0x08b 50270000000000000000 | mrmovq (rdi) rdx val = a0x095 6222 | andq rdx rdx Test val0x097 748700000000000000 | jne Loop If 0 goto Loop0x0a0 | Done0x0a0 90 | ret

9262016 (copyJP Shen) 18-600 Lecture 8 39

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Simulating Y86-64 Program

Instruction set simulator Computes effect of each instruction on processor state Prints changes in state from original

unixgt yis lenyo

Stopped in 33 steps at PC = 0x13 Status HLT CC Z=1 S=0 O=0Changes to registersrax 0x0000000000000000 0x0000000000000004rsp 0x0000000000000000 0x0000000000000100rdi 0x0000000000000000 0x0000000000000038r8 0x0000000000000000 0x0000000000000001r9 0x0000000000000000 0x0000000000000008

Changes to memory0x00f0 0x0000000000000000 0x00000000000000530x00f8 0x0000000000000000 0x0000000000000013

9262016 (copyJP Shen) 18-600 Lecture 8 40

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 41

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Building BlocksCombinational Logic

Compute Boolean functions of inputs

Continuously respond to input changes

Operate on data and implement control

Storage Elements Store bits Addressable memories Non-addressable registers Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

9262016 (copyJP Shen) 18-600 Lecture 8 42

Funct

A B

ALU

Deco

de 00 11MUXSel

1001

COMP

=

PriorityEncode

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

(Cache) Memory Implementation Optionskey idx key

tag data tag data

deco

der

Associative Memory(CAM)

no indexunlimited blocks

N-Way Set-Associative Memory

k-bit index2k bull N blocks

9262016 (copyJP Shen) 18-600 Lecture 8 43

indexde

code

r

Indexed Memory

k-bit index2k blocks

Indexed Memory(Multi-Ported)(2x) k-bit index(2x) 2k blocks

index

deco

der

index

deco

der

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation

Parts we want to explore and modify

Data Types bool Boolean

a b c hellip int words

A B C hellip Does not specify word size---bytes 32-bit words hellip

Statements bool a = bool-expr

int A = int-expr

9262016 (copyJP Shen) 18-600 Lecture 8 44

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

HCL Operations Classify by type of value returned

Boolean Expressions Logic Operations

a ampamp b a || b a Word Comparisons

A == B A = B A lt B A lt= B A gt= B A gt B Set Membership

A in B C D raquo Same as A == B || A == C || A == D

Word Expressions Case expressions

[ a A b B c C ] Evaluate test expressions a b c hellip in sequence Return word expression A B C hellip for first successful test

9262016 (copyJP Shen) 18-600 Lecture 8 45

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Hardware StructureState

Program counter register (PC) Condition code register (CC) Register File Memories

Access same memory space Data for readingwriting program

data Instruction for reading

instructions

Instruction Flow Read instruction at address

specified by PC Process through stages Update program counter

9262016 (copyJP Shen) 18-600 Lecture 8 46

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ StagesFetch

Read instruction from instruction memory

Decode Read program registers

Execute Compute value or address

Memory Read or write data

Write Back Write program registers

PC Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode ifunrA rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA srcBdstA dstB

valA valB

aluA aluB

Cnd

valE

Addr Data

valM

PCvalE valM

newPC

9262016 (copyJP Shen) 18-600 Lecture 8 47

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Decoding

Instruction Format Instruction byte icodeifun Optional register byte rArB Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

9262016 (copyJP Shen) 18-600 Lecture 8 48

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ArithmeticLogical Operation

Fetch Read 2 bytes

Decode Read operand registers

Execute Perform operation Set condition codes

Memory Do nothing

Write back Update register

PC Update Increment PC by 2

OPq rA rB 6 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 49

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ArithLog Ops

Formulate instruction execution as sequence of simple steps

Use same general form for all instructions

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing rmmovq

Fetch Read 10 bytes

Decode Read operand registers

Execute Compute effective address

Memory Write to memory

Write back Do nothing

PC Update Increment PC by 10

rmmovq rA D(rB) 4 0 rA rB D

9262016 (copyJP Shen) 18-600 Lecture 8 51

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation rmmovq

Use ALU for address computation

rmmovq rA D(rB)icodeifun larr M1[PC]rArB larr M1[PC+1]valC larr M8[PC+2]valP larr PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB + valCExecute

Compute effective address

M8[valE] larr valAMemory Write value to memory Writeback

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 52

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing popq

Fetch Read 2 bytes

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read from old stack pointer

Write back Update stack pointer Write result to register

PC Update Increment PC by 2

popq rA b 0 rA 8

9262016 (copyJP Shen) 18-600 Lecture 8 53

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation popq

Use ALU to increment stack pointer Must update two registers

Popped value New stack pointer

popq rAicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rsp]valB larr R[rsp]

DecodeRead stack pointerRead stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA]Memory Read from stack R[rsp] larr valER[rA] larr valM

Writeback

Update stack pointerWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 54

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Conditional Moves

Fetch Read 2 bytes

Decode Read operand registers

Execute If cnd then set destination

register to 0xF

Memory Do nothing

Write back Update register (or not)

PC Update Increment PC by 2

cmovXX rA rB 2 fn rA rB

9262016 (copyJP Shen) 18-600 Lecture 8 55

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Cond Move

Read register rA and pass through ALU Cancel move by setting destination register to 0xF

If condition codes amp move condition indicate no move

cmovXX rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA larr R[rA]valB larr 0

DecodeRead operand A

valE larr valB + valAIf Cond(CCifun) rB larr 0xF

ExecutePass valA through ALU(Disable register update)

MemoryR[rB] larr valEWrite

backWrite back result

PC larr valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 56

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing Jumps

Fetch Read 9 bytes Increment PC by 9

Decode Do nothing

Execute Determine whether to take

branch based on jump condition and condition codes

Memory Do nothing

Write back Do nothing

PC Update Set PC to Dest if branch

taken or to incremented PC if not branch

jXX Dest 7 fn Dest

XX XXfall thru

XX XXtarget

Not taken

Taken

9262016 (copyJP Shen) 18-600 Lecture 8 57

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation Jumps

Compute both addresses Choose based on setting of condition codes and

branch condition

jXX Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd larr Cond(CCifun)Execute

Take branchMemoryWriteback

PC larr Cnd valC valPPC update Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 58

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing call

Fetch Read 9 bytes Increment PC by 9

Decode Read stack pointer

Execute Decrement stack pointer by 8

Memory Write incremented PC to

new value of stack pointer

Write back Update stack pointer

PC Update Set PC to Dest

call Dest 8 0 Dest

XX XXreturn

XX XXtarget

9262016 (copyJP Shen) 18-600 Lecture 8 59

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation call

Use ALU to decrement stack pointer Store incremented PC

call Desticodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

Fetch

Read instruction byte

Read destination address Compute return point

valB larr R[rsp]Decode

Read stack pointervalE larr valB + ndash8

ExecuteDecrement stack pointer

M8[valE] larr valPMemory Write return value on stack R[rsp] larr valEWrite

backUpdate stack pointer

PC larr valCPC update Set PC to destination

9262016 (copyJP Shen) 18-600 Lecture 8 60

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Executing ret

Fetch Read 1 byte

Decode Read stack pointer

Execute Increment stack pointer by 8

Memory Read return address from

old stack pointer

Write back Update stack pointer

PC Update Set PC to return address

ret 9 0

XX XXreturn

9262016 (copyJP Shen) 18-600 Lecture 8 61

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Stage Computation ret

Use ALU to increment stack pointer Read return address from memory

ret

icodeifun larr M1[PC]

Fetch

Read instruction byte

valA larr R[rsp]valB larr R[rsp]

DecodeRead operand stack pointerRead operand stack pointer

valE larr valB + 8Execute

Increment stack pointer

valM larr M8[valA] Memory Read return addressR[rsp] larr valEWrite

backUpdate stack pointer

PC larr valMPC update Set PC to return address

9262016 (copyJP Shen) 18-600 Lecture 8 62

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

OPq rA rBicodeifun larr M1[PC]rArB larr M1[PC+1]

valP larr PC+2

Fetch

Read instruction byteRead register byte[Read constant word]Compute next PC

valA larr R[rA]valB larr R[rB]

DecodeRead operand ARead operand B

valE larr valB OP valASet CC

ExecutePerform ALU operationSetuse cond code reg

Memory [Memory readwrite] R[rB] larr valEWrite

backWrite back ALU result[Write back memory result]

PC larr valPPC update Update PC

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

9262016 (copyJP Shen) 18-600 Lecture 8 63

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computation Steps

All instructions follow same general pattern Differ in what gets computed on each step

call Dest

Fetch

Decode

Execute

MemoryWritebackPC update

icodeifunrArBvalCvalPvalA srcAvalB srcBvalECond codevalMdstEdstMPC

icodeifun larr M1[PC]

valC larr M8[PC+1]valP larr PC+9

valB larr R[rsp]valE larr valB + ndash8

M8[valE] larr valPR[rsp] larr valE

PC larr valC

Read instruction byte[Read register byte]Read constant wordCompute next PC[Read operand A]Read operand BPerform ALU operation[Set use cond code reg]Memory readwrite Write back ALU result[Write back memory result]Update PC

9262016 (copyJP Shen) 18-600 Lecture 8 64

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computed ValuesFetch

icode Instruction codeifun Instruction functionrA Instr Register ArB Instr Register BvalC Instruction constantvalP Incremented PC

DecodesrcA Register ID AsrcB Register ID BdstE Destination Register EdstM Destination Register MvalA Register value AvalB Register value B

Execute valE ALU result Cnd Branchmove flag

Memory valM Value from

memory

9262016 (copyJP Shen) 18-600 Lecture 8 65

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ HardwareKey

Blue boxes predesigned hardware blocks Eg memories ALU

Gray boxes control logic Describe in HCL

White ovals labels for signals

Thick lines 64-bit word values

Thin lines 4-8 bit values

Dotted lines 1-bit values

9262016 (copyJP Shen) 18-600 Lecture 8 66

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Predefined Blocks PC Register containing PC Instruction memory Read 10 bytes

(PC to PC+9) Signal invalid address

Split Divide instruction byte into icode and ifun

Align Get fields for rA rB and valCInstructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 67

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Logic

Control Logic Instr Valid Is this instruction valid icode ifun Generate no-op if invalid

address Need regids Does this instruction

have a register byte Need valC Does this instruction have

a constant word Instructionmemory

PCincrement

rBicode ifun rA

PC

valC valP

Needregids

NeedvalC

Instrvalid

AlignSplitBytes 1-9Byte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 68

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

Determine instruction codeint icode = [

imem_error INOP1 imem_icode

]

Determine instruction functionint ifun = [

imem_error FNONE1 imem_ifun

]

Instructionmemory

PC

SplitByte 0

imem_error

icode ifun

9262016 (copyJP Shen) 18-600 Lecture 8 69

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Fetch Control Logic in HCL

bool need_regids =icode in IRRMOVQ IOPQ IPUSHQ IPOPQ

IIRMOVQ IRMMOVQ IMRMOVQ

bool instr_valid = icode in INOP IHALT IRRMOVQ IIRMOVQ IRMMOVQ IMRMOVQ

IOPQ IJXX ICALL IRET IPUSHQ IPOPQ

popq rA A 0 rA F

jXX Dest 7 fn Dest

popq rA B 0 rA F

call Dest 8 0 Dest

cmovXX rA rB 2 fn rA rB

irmovq V rB 3 0 8 rB V

rmmovq rA D(rB) 4 0 rA rB D

mrmovq D(rB) rA 5 0 rA rB D

OPq rA rB 6 fn rA rB

ret 9 0

nop 1 0

halt 0 0

9262016 (copyJP Shen) 18-600 Lecture 8 70

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Decode LogicRegister File

Read ports A B Write ports E M Addresses are register IDs or

15 (0xF) (no access)

Control Logic srcA srcB read port addresses dstE dstM write port addresses

rB

dstE dstM srcA srcB

Registerfile

A BM

EdstE dstM srcA srcB

icode rA

valBvalA valEvalMCnd

Signals Cnd Indicate whether or not to

perform conditional move Computed in Execute stage

9262016 (copyJP Shen) 18-600 Lecture 8 71

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

A Source

int srcA = [icode in IRRMOVQ IRMMOVQ IOPQ IPUSHQ rAicode in IPOPQ IRET RRSP1 RNONE Dont need register

]

cmovXX rA rBvalA larr R[rA]Decode Read operand A

rmmovq rA D(rB)valA larr R[rA]Decode Read operand A

popq rAvalA larr R[rsp]Decode Read stack pointer

jXX DestDecode No operand

call Dest

valA larr R[rsp]Decode Read stack pointerret

Decode No operand

OPq rA rBvalA larr R[rA]Decode Read operand A

9262016 (copyJP Shen) 18-600 Lecture 8 72

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

E Destination

int dstE = [icode in IRRMOVQ ampamp Cnd rBicode in IIRMOVQ IOPQ rBicode in IPUSHQ IPOPQ ICALL IRET RRSP1 RNONE Dont write any register

]

None

R[rsp] larr valE Update stack pointer

None

R[rB] larr valEcmovXX rA rB

Write-back

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Write-back

Write-back

Write-back

Write-back

Write-back

Conditionally write back result

R[rsp] larr valE Update stack pointer

R[rsp] larr valE Update stack pointer

R[rB] larr valEOPq rA rB

Write-back Write back result

9262016 (copyJP Shen) 18-600 Lecture 8 73

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Execute LogicUnits

ALU Implements 4 required functions Generates condition code values

CC Register with 3 condition code

bits cond

Computes conditional jumpmove flag

Control Logic Set CC Should condition code

register be loaded ALU A Input A to ALU ALU B Input B to ALU ALU fun What function should

ALU compute

CC ALU

ALUA

ALUB

ALUfun

Cnd

icode ifun valC valBvalA

valE

SetCC

cond

9262016 (copyJP Shen) 18-600 Lecture 8 74

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU A Input

int aluA = [icode in IRRMOVQ IOPQ valAicode in IIRMOVQ IRMMOVQ IMRMOVQ valCicode in ICALL IPUSHQ -8icode in IRET IPOPQ 8 Other instructions dont need ALU

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPq rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 75

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

ALU Operation

int alufun = [icode == IOPQ ifun1 ALUADD

]

valE larr valB + ndash8 Decrement stack pointer

No operation

valE larr valB + 8 Increment stack pointer

valE larr valB + valC Compute effective address

valE larr 0 + valA Pass valA through ALUcmovXX rA rB

Execute

rmmovl rA D(rB)

popq rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE larr valB + 8 Increment stack pointer

valE larr valB OP valA Perform ALU operationOPl rA rB

Execute

9262016 (copyJP Shen) 18-600 Lecture 8 76

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory Logic

Memory Reads or writes memory

word

Control Logic stat What is instruction

status Mem read should word be

read Mem write should word be

written Mem addr Select address Mem data Select data

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

9262016 (copyJP Shen) 18-600 Lecture 8 77

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Status

Control Logic stat What is instruction

status

Datamemory

Memread

Memaddr

read

write

data out

Memdata

valE

valM

valA valP

Memwrite

data in

icode

Stat

dmem_error

instr_valid

imem_error

stat

Determine instruction statusint Stat = [

imem_error || dmem_error SADRinstr_valid SINSicode == IHALT SHLT1 SAOK

]

9262016 (copyJP Shen) 18-600 Lecture 8 78

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory AddressOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

int mem_addr = [icode in IRMMOVQ IPUSHQ ICALL IMRMOVQ valEicode in IPOPQ IRET valA Other instructions dont need address

]

9262016 (copyJP Shen) 18-600 Lecture 8 79

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Memory ReadOPq rA rB

Memory

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

No operation

M8[valE] larr valAMemory Write value to memory

valM larr M8[valA]Memory Read from stack

M8[valE] larr valPMemory Write return value on stack

valM larr M8[valA] Memory Read return address

Memory No operation

bool mem_read = icode in IMRMOVQ IPOPQ IRET

9262016 (copyJP Shen) 18-600 Lecture 8 80

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

PC Update OPq rA rB

rmmovq rA D(rB)

popq rA

jXX Dest

call Dest

ret

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr valPPC update Update PC

PC larr Cnd valC valPPC update Update PC

PC larr valCPC update Set PC to destination

PC larr valMPC update Set PC to return address

int new_pc = [icode == ICALL valCicode == IJXX ampamp Cnd valCicode == IRET valM1 valP

]

9262016 (copyJP Shen) 18-600 Lecture 8 81

New PC Select next value of PC

NewPC

Cndicode valC valPvalM

PC

Recitation

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ OperationState

PC register Cond Code register Data memory Register fileUpdated as clock rises

Combinational Logic ALU Control logic Memory reads

Instruction memory Register file Data memory

Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 82

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 2

state set according to second irmovq instruction combinational logic starting to react to state changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4Combinationallogic

Datamemory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 83

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 3

state set according to second irmovq instruction combinational logic generates results for addq instruction

Combinationallogic Data

memory

Registerfile

rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000

rbxlt--

0x300

Read Write

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

9262016 (copyJP Shen) 18-600 Lecture 8 84

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 4

state set according to addq instruction combinational logic starting to react to state

changes

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 85

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

SEQ Operation 5

state set according to addq instruction combinational logic generates results for je

instruction

0x014 addq rdxrbx rbx lt-- 0x300 CC lt-- 000

0x016 je dest Not taken

0x01f rmmovq rbx0(rdx) M[0x200] lt-- 0x300

Cycle 3

Cycle 4

Cycle 5

0x00a irmovq $0x200rdx rdx lt-- 0x200Cycle 2

0x000 irmovq $0x100rbx rbx lt-- 0x100Cycle 1

ClockCycle 1

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

9262016 (copyJP Shen) 18-600 Lecture 8 86

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 8ldquoProcessor Architecture and Designrdquo

1 Processor Architecturea Instruction Set Architecture (ISA)b Instruction Set Designsc Y86-64 Instruction Set

2 Processor Implementationa Instruction Set Processor (CPU)b Processor Organization Designc Y86-64 Sequential Processor

3 Motivation for Pipelining

9262016 (copyJP Shen) 18-600 Lecture 8 87

18-600 Foundations of Computer Systems

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Computational Example

Systembull Computation requires total of 300 picosecondsbull Additional 20 picoseconds to save result in registerbull Must have clock cycle of at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 312 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 88

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

3-Way Pipelined Version

Systembull Divide combinational logic into 3 blocks of 100 ps eachbull Can begin new operation as soon as previous one passes through stage A

bull Begin new operation every 120 psbull Overall latency increases

bull 360 ps from start to finish

Reg

Clock

Comblogic

A

Reg

Comblogic

B

Reg

Comblogic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 833 GIPS

9262016 (copyJP Shen) 18-600 Lecture 8 89

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

16x16 combinational multiplier ISCAS-85 C6288 standard benchmark

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

[Source J Hayes Univ of Michigan]

9262016 (copyJP Shen) 18-600 Lecture 8 90

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Example Integer Multiplier

Configuration Delay MPS Area (FFwiring) Area IncreaseCombinational 352ns 284 7535 (--1759)2 Stages 187ns 534 (19x) 8725 (10781870) 164 Stages 117ns 855 (30x) 11276 (33882112) 508 Stages 080ns 1250 (44x) 17127 (89382612) 127

Pipeline efficiency 2-stage nearly double throughput marginal area cost 4-stage 75 efficiency area still reasonable 8-stage 55 efficiency area more than doubles

Tools Synopsys DCLSI Logic 110nm gflxp ASIC

9262016 (copyJP Shen) 18-600 Lecture 8 91

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining FundamentalsMotivation

bull Increase throughput with little increase in hardware

Bandwidth or Throughput = Performance

Bandwidth (BW) = no of tasksunit time For a system that operates on one task at a time

bull BW = 1delay (latency) BW can be increased by pipelining if many operands exist which need the

same operation ie many repetitions of the same task are to be performed Latency required for each task remains the same or may even increase slightly

9262016 (copyJP Shen) 18-600 Lecture 8 92

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Limitations Register Overhead

bull As we try to deepen pipeline overhead of loading registers becomes more significantbull Percentage of clock cycle spent loading register

bull 1-stage pipeline 625 bull 3-stage pipeline 1667 bull 6-stage pipeline 2857

bull High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps Throughput = 1429 GIPSClock

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

Reg

Comblogic

50 ps 20 ps

9262016 (copyJP Shen) 18-600 Lecture 8 93

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with propagation delay T and BW = 1T

Ppipelined=BWpipelined = 1 (T k +S )

whereS = delay through latch and overhead

TS

S

Tk

Tk

k-stage pipelinedunpipelined

Pipelining Performance Model

9262016 (copyJP Shen) 18-600 Lecture 8 94

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Starting from an un-pipelined version with hardware cost G

Costpipelined = kL + G

where L = cost of adding each latch andk = number of stages

GL

L

Gk

Gk

k-stage pipelinedunpipelined

Hardware Cost Model

9262016 (copyJP Shen) 18-600 Lecture 8 95

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

CostPerformance CP = [Lk + G] [1(Tk + S)] = (Lk + G) (Tk + S)

= LT + GS + LSk + GTk

Optimal CostPerformance find min CP wrt choice of k

CostPerformance Trade-off

k

CP

[Peter M Kogge 1981]

9262016 (copyJP Shen) 18-600 Lecture 8 96

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104

Cos

tPer

form

ance

Rat

io (C

P)

G=175 L=41 T=400 S=22

G=175 L=21 T=400 S=11

9262016 (copyJP Shen) 18-600 Lecture 8 97

ldquoOptimalrdquo Pipeline Depth (kopt) Examples

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Chart3

100
100
91152
80556
57054
45787
46289333333
34351333333
41358
28749
3876
2548
37328666667
23377666667
36564
21942
36216
20923
36145777778
20181777778
3627
19635
36535636364
19229636364
36907333333
18930333333
37360615385
18712615385
37878
18559
38446666667
18456666667
39057
18396
39701647059
18369647059
40374888889
18371888889
41072210526
18398210526
4179
18445
42525333333
18509333333
43275818182
18588818182
44039478261
18681478261
44814666667
18785666667
456
189
46394307692
19023307692
47196592593
19154592593
48006
19293
48821793103
19437793103
49643333333
19588333333
50470064516
19744064516
513015
199045
52137212121
20069212121
52976823529
20237823529
5382
2041
54666444444
20585444444
55515891892
20763891892
56368105263
20945105263
57222871795
21128871795
5808
21315
58939317073
21503317073
59800666667
21693666667
60663906977
21885906977
61528909091
22079909091
62395555556
22275555556
6326373913
2247273913
64133361702
22671361702
65004333333
22871333333
65876571429
23072571429
6675
23275

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50

Sheet1

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo
0 100 100
1 91152 80556
2 57054 45787
3 46289333333 34351333333
4 41358 28749
5 3876 2548
6 37328666667 23377666667
7 36564 21942
8 36216 20923
9 36145777778 20181777778
10 3627 19635
11 36535636364 19229636364
12 36907333333 18930333333
13 37360615385 18712615385
14 37878 18559
15 38446666667 18456666667
16 39057 18396
17 39701647059 18369647059
18 40374888889 18371888889
19 41072210526 18398210526
20 4179 18445
21 42525333333 18509333333
22 43275818182 18588818182
23 44039478261 18681478261
24 44814666667 18785666667
25 456 189
26 46394307692 19023307692
27 47196592593 19154592593
28 48006 19293
29 48821793103 19437793103
30 49643333333 19588333333
31 50470064516 19744064516
32 513015 199045
33 52137212121 20069212121
34 52976823529 20237823529
35 5382 2041
36 54666444444 20585444444
37 55515891892 20763891892
38 56368105263 20945105263
39 57222871795 21128871795
40 5808 21315
41 58939317073 21503317073
42 59800666667 21693666667
43 60663906977 21885906977
44 61528909091 22079909091
45 62395555556 22275555556
46 6326373913 2247273913
47 64133361702 22671361702
48 65004333333 22871333333
49 65876571429 23072571429
50 6675 23275

Sheet1

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo

Sheet2

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo

Sheet3

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Pipelining Idealistic Assumptions Uniform Sub-computations

The computation to be pipelined can be evenly partitioned into uniform-latency sub-operations

Repetition of Identical ComputationsThe same computations are to be performed repeatedly on a large number

of different inputs

Repetition of Independent Computations All the repetitions of the same computation are mutually independent ie

no data dependency and no resource conflicts

9262016 (copyJP Shen) 18-600 Lecture 8 98

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Instruction Pipeline Design Uniform Sub-computations NOT

rArr balancing pipeline stages- stage quantization to yield balanced pipe stages- minimize internal fragmentation (some waiting stages)

Identical Computations NOT rArr unifying instruction types

- coalescing instruction types into one multi-function pipe- minimize external fragmentation (some idling stages)

Independent Computations NOTrArr resolving pipeline hazards

- inter-instruction dependency detection and resolution- minimize performance loss due to pipeline stalls

9262016 (copyJP Shen) 18-600 Lecture 8 99

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo

Bryant and OrsquoHallaron Computer Systems A Programmerrsquos Perspective Third Edition

Lecture 9ldquoPipelined Processor Designrdquo

John P Shen amp Zhiyi YuSeptember 28 2016

9262016 (copyJP Shen) 18-600 Lecture 8 100

18-600 Foundations of Computer Systems

Required Reading Assignmentbull Chapter 4 of CSAPP (3rd edition) by Randy Bryant amp Dave OrsquoHallaron

Recommended Reference Chapters 1 and 2 of Shen and Lipasti (SnL)

  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computer Instruction Set Architecture
  • Instruction Set Architecture
  • CISC Instruction Sets
  • RISC Instruction Sets
  • Computer Dynamic-Static Interface
  • CISC vs RISC
  • MIPS Registers
  • MIPS Instruction Examples
  • Evolutions of MIPS and ARM ISAs
  • ISA Design Summary
  • Y86-64 Processor State
  • Y86-64 Instruction Set
  • Y86-64 Instruction Formats
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Encoding Register Specifiers
  • Instruction Format Example
  • Arithmetic and Logical Operations
  • Move Operations
  • Move Instruction Examples
  • Conditional Move Instructions
  • Jump Instructions
  • Jump Instructions
  • Y86-64 Program Stack
  • Stack Operations
  • Subroutine Call and Return
  • Miscellaneous Instructions
  • Status Conditions
  • Writing Y86-64 Code
  • Y86-64 Code Generation Example
  • Y86-64 Code Generation Example 2
  • Y86-64 Code Generation Example 3
  • Y86-64 Sample Program Structure 1
  • Y86-64 Program Structure 2
  • Y86-64 Program Structure 3
  • Assembling Y86-64 Program
  • Simulating Y86-64 Program
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Building Blocks
  • (Cache) Memory Implementation Options
  • Hardware Control Language
  • HCL Operations
  • SEQ Hardware Structure
  • SEQ Stages
  • Instruction Decoding
  • Executing ArithmeticLogical Operation
  • Stage Computation ArithLog Ops
  • Executing rmmovq
  • Stage Computation rmmovq
  • Executing popq
  • Stage Computation popq
  • Executing Conditional Moves
  • Stage Computation Cond Move
  • Executing Jumps
  • Stage Computation Jumps
  • Executing call
  • Stage Computation call
  • Executing ret
  • Stage Computation ret
  • Computation Steps
  • Computation Steps
  • Computed Values
  • SEQ Hardware
  • Fetch Logic
  • Fetch Logic
  • Fetch Control Logic in HCL
  • Fetch Control Logic in HCL
  • Decode Logic
  • A Source
  • E Destination
  • Execute Logic
  • ALU A Input
  • ALU Operation
  • Memory Logic
  • Instruction Status
  • Memory Address
  • Memory Read
  • PC Update
  • SEQ Operation
  • SEQ Operation 2
  • SEQ Operation 3
  • SEQ Operation 4
  • SEQ Operation 5
  • Lecture 8ldquoProcessor Architecture and Designrdquo
  • Computational Example
  • 3-Way Pipelined Version
  • Example Integer Multiplier
  • Example Integer Multiplier
  • Pipelining Fundamentals
  • Limitations Register Overhead
  • Pipelining Performance Model
  • Hardware Cost Model
  • CostPerformance Trade-off
  • Slide Number 97
  • Pipelining Idealistic Assumptions
  • Instruction Pipeline Design
  • Lecture 9ldquoPipelined Processor Designrdquo