ALU Architecture and ISA Extensions

ALU Architecture and ISA Extensions

Lecture notes from MKP, H. H. Lee and S. Yalamanchili

(2)

Reading• Sections 3.2-3.5 (only those elements covered

in class)• Sections 3.6-3.8• Appendix B.5• Practice Problems: 26, 27

• Goal: Understand the ISA view of the core microarchitecture Organization of functional units and register files into

basic data paths

(3)

Overview• Instruction Set Architectures have a purpose

Applications dictate what we need

• We only have a fixed number of bits Impact on accuracy

• More is not better We cannot afford everything we want

• Basic Arithmetic Logic Unit (ALU) Design Addition/subtraction, multiplication, division

(4)

Reminder: ISAbyte addressed memory

0xFFFFFFFF

Arithmetic Logic Unit (ALU)

0x000x010x020x03

0x1FProcessor Internal Buses

Memory InterfaceRegister File (Programmer Visible State)

stack

Data segment(static)

Text Segment

Dynamic Data

Reserved

Program Counter

Programmer Invisible State

Kernelregisters Who sees what?

Memory MapInstruction register

(5)

Arithmetic for Computers• Operations on integers

Addition and subtraction Multiplication and division Dealing with overflow

• Operation on floating-point real numbers Representation and operations

• Let us first look at integers

(6)

Integer Addition(3.2)• Example: 7 + 6

Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands

Overflow if result sign is 1 Adding two –ve operands

Overflow if result sign is 0

(7)

Integer Subtraction• Add negation of second operand• Example: 7 – 6 = 7 + (–6)

+7: 0000 0000 … 0000 0111–6: 1111 1111 … 1111 1010+1: 0000 0000 … 0000 0001

• Overflow if result out of range Subtracting two +ve or two –ve operands, no overflow Subtracting +ve from –ve operand

o Overflow if result sign is 0 Subtracting –ve from +ve operand

o Overflow if result sign is 1

2’s complement representation

(8)

ISA Impact• Some languages (e.g., C) ignore overflow

Use MIPS addu, addui, subu instructions• Other languages (e.g., Ada, Fortran) require

raising an exception Use MIPS add, addi, sub instructions On overflow, invoke exception handler

o Save PC in exception program counter (EPC) registero Jump to predefined handler addresso mfc0 (move from coprocessor register) instruction can

retrieve EPC value, to return after corrective action (more later)

• ALU Design leads to many solutions. We look at one simple example

(9)

• Build a 1 bit ALU, and use 32 of them (bit-slice)

ba

operation

result

op a b res

Integer ALU (arithmetic logic unit)(B.5)

(10)

Single Bit ALU

0

1A

B

Result

Operation

Implements only AND and OR operations

(11)

• We can add additional operators (to a point)

• How about addition?

• Review full adders from digital design

Adding Functionality

cout = ab + acin + bcin

sum = a b cinSum

CarryIn

CarryOut

a

b

(12)

Building a 32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

(13)

• Two's complement approach: just negate b and add 1.

• How do we negate?

• A clever solution:

Subtraction (a – b) ?

Binvert

b31

b0

b1

b2

Result31a31

Result0

CarryIn

a0

Result1a1

Result2a2

Operation

ALU0CarryIn

CarryOut

ALU1CarryIn

CarryOut

ALU2CarryIn

CarryOut

ALU31CarryIn

0

2

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b

sub

(14)

• Need to support the set-on-less-than instruction(slt) remember: slt is an arithmetic instruction produces a 1 if rs < rt and 0 otherwise use subtraction: (a-b) < 0 implies a < b

• Need to support test for equality (beq $t5, $t6, $t7) use subtraction: (a-b) = 0 implies a = b

Tailoring the ALU to the MIPS

(15)

Seta31

0

ALU0 Result0

CarryIn

a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Binvert

CarryIn

Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

What Result31 is when (a-b)<0?

0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

Unsigned vs. signed support

(16)

Test for equality• Notice control lines:

000 = and001 = or010 = add110 = subtract111 = slt

• Note: zero is a 1 when the result is zero!

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

Note test for overflow!

(17)

ISA View

• Register-to-Register data path• We want this to be as fast as possible

ALU

$0$1

$31

CPU/Core

(18)

Multiplication (3.3)• Long multiplication

1000× 1001 1000 0000 0000 1000 1001000

Length of product is the sum of operand lengths

multiplicand

multiplier

product

(19)

A Multiplier• Uses multiple adders

Cost/performance tradeoff

Can be pipelined Several multiplication performed in parallel

(20)

MIPS Multiplication• Two 32-bit registers for product

HI: most-significant 32 bits LO: least-significant 32-bits

• Instructions mult rs, rt / multu rs, rt

o 64-bit product in HI/LO mfhi rd / mflo rd

o Move from HI/LO to rdo Can test HI value to see if product

overflows 32 bits mul rd, rs, rt

o Least-significant 32 bits of product –> rd

Study Exercise: Check out signed and unsigned multiplication with QtSPIM

(21)

Division(3.4)• Check for 0 divisor• Long division approach

If divisor ≤ dividend bitso 1 bit in quotient, subtract

Otherwiseo 0 bit in quotient, bring down

next dividend bit• Restoring division

Do the subtract, and if remainder goes < 0, add divisor back

• Signed division Divide using absolute values Adjust sign of quotient and

remainder as required

10011000 1001010 -1000 10 101 1010 -1000 10

n-bit operands yield n-bitquotient and remainder

quotient

dividend

remainder

divisor

(22)

Faster Division• Can’t use parallel hardware as in multiplier

Subtraction is conditional on sign of remainder• Faster dividers (e.g. SRT division) generate

multiple quotient bits per step Still require multiple steps

• Customized implementations for high performance, e.g., supercomputers

(23)

MIPS Division• Use HI/LO registers for result

HI: 32-bit remainder LO: 32-bit quotient

• Instructions div rs, rt / divu rs, rt No overflow or divide-by-0

checkingo Software must perform checks if

required Use mfhi, mflo to access result

Study Exercise: Check out signed and unsigned division with QtSPIM

(24)

ISA View

• Additional function units and registers (Hi/Lo)• Additional instructions to move data to/from

these registers mfhi, mflo

• What other instructions would you add? Cost?

ALU

Hi

Multiply Divide

Lo

$0$1

$31

CPU/Core

(25)

Floating Point(3.5)• Representation for non-integral numbers

Including very small and very large numbers• Like scientific notation

–2.34 × 1056

+0.002 × 10–4

+987.02 × 109

• In binary ±1.xxxxxxx2 × 2yyyy

• Types float and double in C

normalized

not normalized

(26)

IEEE 754 Floating-point Representation

2928272625242322212019181716151413121110 9 8 7 6 5 4 3 2 1 03130S exponent significand

1bit 8 bits 23 bits

6160595857565554535251504948474645444342414039383736353433326362S exponent significand

1bit 11 bits 20 bitssignificand (continued)

32 bits

Single Precision (32-bit)

Double Precision (64-bit)

(–1)sign x (1+fraction) x 2exponent-127

(–1)sign x (1+fraction) x 2exponent-1023

(27)

Floating Point Standard• Defined by IEEE Std 754-1985• Developed in response to divergence of

representations Portability issues for scientific code

• Now almost universally adopted• Two representations

Single precision (32-bit) Double precision (64-bit)

(28)

FP Adder Hardware• Much more complex than integer adder• Doing it in one clock cycle would take too long

Much longer than integer operations Slower clock would penalize all instructions

• FP adder usually takes several cycles Can be pipelined

Example: FP Addition

(29)

FP Adder Hardware

Step 1

Step 2

Step 3

Step 4

(30)

FP Arithmetic Hardware• FP multiplier is of similar complexity to FP

adder But uses a multiplier for significands instead of an

adder• FP arithmetic hardware usually does

Addition, subtraction, multiplication, division, reciprocal, square-root

FP integer conversion• Operations usually takes several cycles

Can be pipelined

(31)

ISA Impact• FP hardware is coprocessor 1

Adjunct processor that extends the ISA• Separate FP registers

32 single-precision: $f0, $f1, … $f31 Paired for double-precision: $f0/$f1, $f2/$f3, …

o Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s

• FP instructions operate only on FP registers Programs generally do not perform integer ops on FP

data, or vice versa More registers with minimal code-size impact

(32)

ISA View: The Co-Processor

• Floating point operations access a separate set of 32-bit registers Pairs of 32-bit registers are used for double precision

ALU

Hi

Multiply Divide

Lo

$0$1

$31

FP ALU

$0$1

$31

BadVaddrStatus

CausesEPC

CPU/Core Co-Processor 1

Co-Processor 0

later

(33)

ISA View• Distinct instructions operate on the floating

point registers (pg. A-73) Arithmetic instructions

o add.d fd, fs, ft, and add.s fd, fs, ft

• Data movement to/from floating point coprocessors mcf1 rt, fs and mtc1 rd, fs

• Note that the ISA design implementation is extensible via co-processors

• FP load and store instructions lwc1, ldc1, swc1, sdc1

o e.g., ldc1 $f8, 32($sp)

single precisiondouble precision

Example: DP Mean

(34)

Associativity• Floating point arithmetic is not commutative• Parallel programs may interleave operations in

unexpected orders Assumptions of associativity may fail

(x+y)+z x+(y+z)x -1.50E+38 -1.50E+38y 1.50E+38z 1.0 1.0

1.00E+00 0.00E+00

0.00E+001.50E+38

Need to validate parallel programs under varying degrees of parallelism

(35)

Performance Issues• Latency of instructions

Integer instructions can take a single cycle Floating point instructions can take multiple cycles Some (FP Divide) can take hundreds of cycles

• What about energy (we will get to that shortly)• What other instructions would you like in

hardware? Would some applications change your mind?

• How do you decide whether to add new instructions?

(36)

Characterizing Parallelism

• Characterization due to M. Flynn*

SISD SIMD

MISD MIMD

Single instruction multiple data stream computing, e.g., SSE

Data StreamsIn

stru

ctio

n St

ream

sToday serial computing cores

(von Neumann model)

Today’s Multicore

*M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960t

http://en.wikipedia.org/w/index.php?title=Template:Cite_doi/10.1109.2FTC.1972.5009071&action=edit&editintro=Template:Cite_doi/editintro2

(37)

Parallelism Categories

From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy

(38)

Multimedia (3.6, 3.7, 3.8)• Lower dynamic range and precision

requirements Do not need 32-bits!

• Inherent parallelism in the operations

(39)

Vector Computation• Operate on multiple data elements (vectors) at

a time• Flexible definition/use of registers

• Registers hold integers, floats (SP), doubles DP)

1x128 bit integer

4 x 32-bit single precision

2x64-bit double precision

8x16 short integers

128-bit Register

(40)

Processing Vectors

Memory

vector registers

• When is this more efficient?

• When is this not efficient?• Think of 3D graphics, linear algebra and media

processing

(41)

Case Study: Intel Streaming SIMD Extensions

• 8, 128-bit XMM registers X86-64 adds 8 more registers XMM8-XMM15

• 8, 16, 32, 64 bit integers (SSE2)• 32-bit (SP) and 64-bit (DP) floating point• Signed/unsigned integer operations • IEEE 754 floating point support• Reading Assignment:

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions



http://neilkemp.us/src/sse_tutorial/sse_tutorial.html%23I

http://neilkemp.us/src/sse_tutorial/sse_tutorial.html%23I

(42)

Instruction Categories• Floating point instructions

Arithmetic, movement Comparison, shuffling Type conversion, bit level

• Integer• Other

e.g., cache management• ISA extensions!• Advanced Vector

Extensions (AVX) Successor to SSE

register

registermemory

(43)

Arithmetic View• Graphics and media processing operates on

vectors of 8-bit and 16-bit data Use 64-bit adder, with partitioned carry chain

o Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors SIMD (single-instruction, multiple-data)

• Saturating operations On overflow, result is largest representable value

o c.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video

4x16-bit 2x32-bit

(44)

SSE Example// A 16byte = 128bit vector structstruct Vector4{ float x, y, z, w; };

// Add two constant vectors and return the resulting vectorVector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ){ Vector4 Ret_Vector;

__asm { MOV EAX Op_A // Load pointers into CPU regs MOV EBX, Op_B

MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX]

ADDPS XMM0, XMM1 // Add vector elements MOVUPS [Ret_Vector], XMM0 // Save the return vector } return Ret_Vector;}

From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

More complex example (matrix

multiply) in Section 3.8 – using AVX

(45)

Intel Xeon Phi

ww

w.a

nand

tech

.com

www.anandtech.com

www.techpowerup.com

(46)

Data Parallel vs. Traditional Vector

Vector Register

A

Vector Register

B

Vector Register

Cpipelined functional unit

registers

Vector Architecture

Data Parallel Architecture

Process each square in parallel – data parallel

computation

(47)

ISA View

• Separate core data path• Can be viewed as a co-processor with a distinct

set of instructions

ALU

Hi

Multiply Divide

Lo

$0$1

$31

Vector ALU

XMM0XMM1

XMM15

CPU/Core SIMD Registers

(48)

Domain Impact on the ISA: Example

• Floats• Double precision• Massive data• Power

constrained

• Integers• Lower precision• Streaming data• Security support• Energy

constrained

Scientific Computing Embedded Systems

(49)

Summary• ISAs support operations required of application

domains Note the differences between embedded and

supercomputers! Signed, unsigned, FP, SIMD, etc.

• Bounded precision effects Software must be careful how hardware used e.g.,

associativity Need standards to promote portability

• Avoid “kitchen sink” designs There is no free lunch Impact on speed and energy we will get to this later

(50)

Study Guide• Perform 2’s complement addition and subtraction

(review)• Add a few more instructions to the simple ALU

Add an XOR instruction Add an instruction that returns the max of its inputs Make sure all control signals are accounted for

• Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review)

• Execute the SPIM programs (class website) that use floating point numbers. Study the memory/register contents via single step execution

(51)

Study Guide (cont.)• Write a few simple SPIM programs for

Multiplication/division of signed and unsigned numberso Use numbers that produce >32-bit resultso Move to/from HI and LO registers ( find the instructions

for doing so) Addition/subtraction of floating point numbers

• Try to write a simple SPIM program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers)

• Look up additional SIMD instruction sets and compare AMD NEON, Altivec, AMD 3D Now

(52)

Glossary• Co-processor• Data parallelism• Data parallel

computation vs. vector computation

• Instruction set extensions

• Overflow• MIMD

• Precision• SIMD• Saturating

arithmetic• Signed arithmetic

support• Unsigned

arithmetic support

• Vector processing

Documents

ALU Architecture and ISA Extensions