1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status 10/13/2015 For use at CCUT Fall 2015 Some Material

1

ECE 371Microprocessors

Chapter 2

Microprocessors: A High Level View

Herbert G. Mayer, PSUStatus 10/13/2015

For use at CCUT Fall 2015Some Material inherited from Eric Krause @ PSU MS ECE

2

Syllabus

Introduction Block Diagrams Simple µP Operation Arithmetic Logic Unit ALU BCD Arithmetic Instruction Set Architecture (ISA) Iron Law of µP Performance Amdahl’s Law Registers and Score Board Bibliographys

3

Introduction

A microprocessor µP is functionally similar to an old-fashioned mainframe (MF) computer, more even to a typical desktop (DT) computer

Yet there are notable differences! Like a Main Frame (MF) and desktop (DT), a

µP:Has a defined instruction set, named ISA: for Instruction Set Architecture

Has a defined data width, e.g. 8-bit bytes, 60-bit words, etc

Has a defined address range, typically 232 or 264

Includes an arithmetic unit for integer, floating-point, BCD, and bit operations

Includes a memory subsystem, plus several levels of caches

Includes an IO subsystem and has other peripheral devices

4

IntroductionDifferent from MF computer or DT, a typical

µP:Has a smaller power envelope, in order to be usable in laptops

Uses a lower main frequency, in order to generate less heat, also enabling laptop use with long battery life

Manages power consumption and heat generation through frequency variation, and module shut-down

Integrates some peripheral devices onto the same silicon, again saving electrical power, saving board space, plus gaining some speed

Provides a smaller number of duplicated arithmetic logical units (ALU) that compute data; as a result some computations, giving identical clock frequencies, run more slowly on a µP due to the lower degree of parallel execution of instructions

Must fit into a small volume of space, e.g. into a laptop; as a result, there exist fewer options for cooling the actual CPUs

5

Introduction

A µP CPU performs arithmetic-logical operations

Memory holds information, has a path to/from CPU via main bus

How does the µP interact with the real world?

IO ports connect external peripheral devices with HW registers ri, and allow ri to access memory

Buses connect components E.g. data bus, address bus,

specific enabling bits

6

Introduction IO ports connect a variety of IO devices Could be a simple latch to read/write Or a separate embedded system, e.g. an

ethernet controller Common: General Purpose IO, or GPIO, which

exposes raw bits IO controller and memory controller

together can become as complex as the CPU E.g. on a motherboard the CPU chip,

generally the largest component, is surrounded by other, good-sized chips

Trend is to relocate some of their functions onto the same die as the CPU

7

Block Diagram of Itanium µP

8

Actual Photo of Itanium2 µP

9

Block Diagram of AMD Athlon

10

Block Diagram of Compaq Alpha

11

Die Image of Compaq Alpha

12

Simple µP Operation We’ll analyze a

hypothetical, abstract µP operation

In effect the simplest generic operation is:

1. Read numeric input from a keyboard; here we read an integer from the keyboard

2. To perform an operation on the read operand: addition of constant integer 7

3. Followed by displaying the newly computed result

13

Simple µP Operation, Step 1

1. Read numeric (e.g. integer) input data from keyboard, port 5:

µP sends memory address for Port Input Instruction on the address bus

µP sends memory read control signal

Memory sends machine code for Port Read to µP

µP decodes instruction, sends out port 5 on address bus

µP sends control signal for Port Read

Input port sends µP one byte from port 5, by placing it on the data bus

14


2. Add integer constant 7 to the data read from port 5:

µP sends memory address for Add instruction on address bus

µP sends control signal for Read on control bus

Memory returns machine code for Add

µP adds integer constant 7 to the value previously read from input port 5

15


3. Display newly computed integer value:

µP sends memory address for Port Output instruction on address bus

µP sends memory read control signal

Memory returns opcode (machine code) for Port Output instruction

µP decodes instruction, sends output port address (0x7) on address bus

µP sends sum from register on data bus

µP sends port write signal on control bus

16

Arithmetic Logic Unit ALU Full adder produces COUT and SUM

using 3 1-bit inputs, A B and CIN

Just via logic gates

What does this do? See result below right:

17

Arithmetic Logic Unit ALU

18


This simple ALU performs any one of 16 arithmetic functions, on two 4-bit words and a carry in

And what determines the function it performs?

Selection plus MODE! We are going to program this device to

perform a function on the two 4-bit inputs and produce a result on the output pins

19


S3 S2 S1 S0

A= A3 A2 A1 A0 L L L L B= B3 B2 B1 B0 L L L H

Basic Format F= F3 F2 F1 F0 L L H L L L H H A= 0 0 1 1 L H L L B= 0 1 0 1 L H L H

Logic OR F = A + B 0 1 1 1 L H H L L H H H A= 0 0 1 1 H L L L B= 0 1 0 1 H L L H

Logic AND F= A * B 0 0 0 1 H L H L H L H H A= 0 0 1 1 H H L L B= 0 1 0 1 H H L H

Logic XOR F= A × B 0 1 1 0 H H H L H H H H

20

Arithmetic Logic Unit ALU Bitwise OR, meaning fi = ai + bi, e.g. f0 =

a0 + b0, etc, as if there were a dual-input OR gate for each bit of 2 inputs

ALU can perform any of these functions based on how it is programmed!

Thus we have a device usable in different ways based on program

That is a fundamental part of a microprocessor

How do we program it? With an operational code (SELECT lines). We call this the OpCode

We add 7; but how can this be done? Use the OpCode for Add, i.e. a binary

value applied to the select code, to select an operation

21

Binary Coded Decimal Arithmetic

Binary Coded Decimal is a machine-internal data type

BCD uses 4 bits per decimal digit But computes in binary, to re-use integer

ALU part of the processor Since only 10 of 16 numeric choices are

used, it will be necessary to add 6 to each decimal digit computed

I.e. it is necessary add 6 to each group of 4 bits

The raw binary addition would be correct only in binary, and incorrect in BCD

Adding 6 per BCD digit is that needed correction

22


With correction, BCD ALU part works like integer add part, with the additional step of adding 6, but the adder is already implemented in HW

Only a constant storage for decimal 6 is needed

Correction algorithm: Given 2 BCD digits A and B:

if sum( A, B ) >= 10102 or carry( A, B ) = 1 then

sum = add( sum( A, B ), 01102 );

end if;

23


24


25


26

BCD Correction Circuit

27


ALU is in the center of above diagram Register File is an ordered sequence of

flip-flops that can be accessed collectively

Core of algorithm is to do 2 adds The /4 means: this is a 4-bit line MuxB will either take the second operand

or add 0, or add 0x110 It has 3 4-input options on it Registers in front of mux hold values so

they are stable for the mux Output register holds value of first

addition, usable for second addition

28


To read the 2 operands from the reg file, send them into the A/B registers and do the first addition

Then use the second addition. It will take the result of the first addition, then MUX B will switch to add 6 or 0

how do we which? Inspecting the flag register

In this simple case, one bit stores carry information

Output register result goes into the correction detection logig:

It decides, to add 0 or add 6 Combinational logic: If Carry, or if one

of these 6 signals, assert the FIX signal that goes to the controller

29


1. Select desired operand registers in register file

2. Transfer operand A from register file to A operand register

3. Transfer operand B from register file to B operand register

4. Set MUX A to select A for ALU

5. Set MUX B to select B for ALU

6. Apply ADD code to ALU select inputs

7. Transfer first sum to output of ALU Output register

8. Transfer COUT and other flags to output of Status register

9. Set MUX A to select temporary sum for the ALU A input

10.Set MUX B to select correction 0110 or 0000, based on 1st result

11.Transfer final sum to ALU output register

12.Select register in register file to write final result

13.Write result to selected register in register file

30

Instruction Set Architecture (ISA) ISA is boundary between Software (SW) and

Hardware (HW) Specifies logical machine that is visible

to the programmer & compiler Is functional specification for processor

designers That boundary is sometimes a very low-

level piece of system software that handles exceptions, interrupts, and HW-specific services

Could fall into the domain of the OS

31

Instruction Set Architecture (ISA) Specified by an ISA: Operations: what to perform and in which

order Temporary Operand Storage in the CPU:

accumulator, stack, registers Note that stack can be word-sized, even

bit-sized (design of successor for NCR’s Century architecture of the 1970s)

Number of operands per instruction Operand location: where and how to

specify/locate the operands Type and size of operands Instruction Encoding in binary

32

Instruction Set Architecture

ISA: Dynamic Static Interface (DSI)

33

Iron Law of µP Performance

Clock-rate doesn’t count, bus width doesn’t count, the number of registers and operations executed in parallel doesn’t count!

What counts is how long it takes for my computational task to complete. That time is of the essence of computing!

If a MIPS-based solution runs at 1 GHz, completing a program X in 2 minutes, while an Intel Pentium® 4–based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the MIPS solution

34


If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 Y bytes, the Intel solution is generally more attractive

Assuming same execution, performance Meaning of this:

Wall-clock time (Time) is time the user has to wait for program completion

Program Size is indicator of overall physical complexity of computational task

35


36

Amdahl’s Law Articulated by Gene Amdahl During 1967 AFIPS conference; yes,

computers existed then already Stating that the maximum speedup of

a program P is dominated by its sequential portion S

I.e. if some part of P can be perfectly or infinitely accelerated due to numerous parallel processors, but some part S of P is inherently sequential, then the resulting performance is dominated by S

See Wikipedia sample:

37

Amdahl’s Law (From Wikipedia)The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20, no matter how many processors are used

n = element of N, N number of processors

B = element of { 0, 1 }

T(n) = time to execute with n processors

T(n) = T(1) ( B + (1-B) / n )

S(n) = Speedup T(1) / T(n)

S(n) = 1 / ( B + (1 – B ) / n )

38

Amdahl’s Law (From Wikipedia)

39

Registers,Dependences,Score Board

in a Microprocessor

40

Register Dependencies Inter-instruction dependencies, in

engineering parlance also known as dependences, arise between registers or memory locations being defined and used

One instruction computes a result into a register (or memory); another instruction needs that result from that same register, or from that same memory location

Or, one instruction uses a resource; and after its use the same resource is newly recomputed

Dependences require sequential execution, lest the result is unpredictable

41

Register DependenciesTrue-Dependence, AKA Data Dependence: <- synonymous!r3 ← r1 op r2 1: r5 ← r3 op r4 2: Read after Write, RAW

Anti-Dependence, not a true dependenceparallelize under right conditionr3 ← r1 op r2 1:r1 ← r5 op r4 2: Write after Read, WAR

Output Dependence, similar to Anti-Dependence: can do something

r3 ← r1 op r2 1: Writer5 ← r3 op r4 2: User3 ← r6 op r7 3: Write after Write, WAW, use

between

42

Register Dependencies

Control Dependence:

// ri, i = 1..4 come in “live”

if ( condition1 ) {

r3 = r1 op r2;

}else{ see the jump here?

r5 = r3 op r4;

} // end if

write( r3 );

43

Register Renaming Only data dependence is a real

dependence, hence called true dependence

Other dependences are artifacts of insufficient resources, generally insufficient registers

This means: if additional registers were available, then replacing some of these conflicting registers with new, other registers, could make the conflict disappear!

Anti- and Output-Dependences are indeed such false dependences

44

Register Renaming

Original Code:-- r2, r3, r5, r6, r7 come in “live” ... code before

-- r1, r4 are not “live”, don’t have values

L1: r1 ← r2 op r3

L2: r4 ← r1 op r5

L3: r1 ← r3 op r6

L4: r3 ← r1 op r7

Dependences before:

Lx: Ly: x, y = 1..4, which dependence?

45

Register RenamingOriginal Code:L1: r1 ← r2 op r3

L2: r4 ← r1 op r5

L3: r1 ← r3 op r6

L4: r3 ← r1 op r7

Initial Dependences, lots of dependences:

L1, L2 true-Dep with r1

L1, L3 output-Dep with r1

L1, L4 anti-Dep with r3



L3, L4 anti-Dep with r3s

46

Register RenamingWhat could we change, if we had some additional registers?

Compute and use other temporaries via other registers to reduce dependences!

That sometimes allows a higher degree of parallelism, due to a lower degree of dependences

More parallelism means faster execution!

47

Register RenamingOriginal Code: New Code, after adding

regs:

L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 instead

L2: r4 ← r1 op r5 r4 ← r10 op r5 –- r10 instead

L3: r1 ← r3 op r6 r1 ← r30 op r6

L4: r3 ← r1 op r7 r3 ← r1 op r7

Dependences before: Dependences after:

L1, L2 true-Dep with r1 L1, L2 true-Dep with r10

L1, L3 output-Dep with r1 L3, L4 true-Dep with r1

L1, L4 anti-Dep with r3 // ri, i = 1..7 are “live”




48

Register Renaming• With these additional, renamed regs, the new

code could possibly run in half the time!

• First: Compute into r10 instead of r1, but you need to have such an additional register r10; no time penalty!

• Also: Compute in preceding code into r30 instead of r3, if r30 available; also no time penalty!

• Then the following regs are live afterwards: r1, r3, r4, plus the non-modified ones, i.e. r2! r2 came in live, must go out live!

• While r10 and r30 are don’t cares afterwards

49

Score BoardScore-board: array of single-bit HW programmable resources sb[]

Manages other HW resources, specifically registers

In this single-bit HW array sb[], every bit i in sb[i] is associated with a specific machine register, the one identified by i , i.e. ri

Association is by index, i.e. by name: sb[i] belongs to reg ri

Only if sb[i] = 0, does that register i have valid data, and can therefore be used

If sb[i] = 0 then register ri is NOT in process of being written

If bit i is set, i.e. if sb[i] = 1, then that register ri is reserved; i.e it cannot be used

Initially all sb[*] set to 0; hence all registers ri can be used

50

Score BoardExecution constraints:

rd ← rs op rt

if sb[s] or if sb[t] is set → RAW dependence, hence stall the computation; wait until both rs and rt are available

if sb[d] is set→ WAW dependence, hence stall the write; wait until rd has been used; SW can sometimes determine to use another register instead of rd

Else, if none of the 3 registers are in use, dispatch the instruction immediately

51

Score BoardOut of order (ooo) execution has long been common on microprocessors, where speed is critical

To allow ooo execution, upon computing the value of rd : update rd, and clear sb[d]

For uses (reads), HW may use any register i, whose sb[i] is 0

For definitions (writes), HW may set any register j, whose sb[j] is 0

Independent of original order, in which source program was written, i.e. possibly ooo

Ooo instructions retire in original order

52

Bibliography1. http://en.wikipedia.org/wiki/Flynn's_taxonomy

2. http://www.ajwm.net/amayer/papers/B5000.html

3. http://www.robelle.com/smugbook/classic.html

4. http://www.intel.com/design/itanium/manuals.htm

5. http://www.csupomona.edu/~hnriley/www/VonN.html

6. http://cva.stanford.edu/classes/ee482s/scribed/lect11.pdf

Documents

1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status 10/13/2015 For use at CCUT Fall 2015 Some Material