52
ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili

ALU Architecture and ISA Extensions

  • Upload
    tovi

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

ALU Architecture and ISA Extensions. Lecture notes from MKP, H. H. Lee and S. Yalamanchili. Reading. Sections 3.2, 3.3-3.5 (only those elements covered in class), and 3.6 Appendix C.5, B.10. Overview. Instruction Set Architectures have a purpose Applications dictate what we need - PowerPoint PPT Presentation

Citation preview

Page 1: ALU Architecture and ISA Extensions

ALU Architecture and ISA Extensions

Lecture notes from MKP, H. H. Lee and S. Yalamanchili

Page 2: ALU Architecture and ISA Extensions

(2)

Reading• Sections 3.2-3.5 (only those elements covered

in class)• Sections 3.6-3.8• Appendix B.5• Practice Problems: 26, 27

• Goal: Understand the ISA view of the core microarchitecture Organization of functional units and register files into

basic data paths

Page 3: ALU Architecture and ISA Extensions

(3)

Overview• Instruction Set Architectures have a purpose

Applications dictate what we need

• We only have a fixed number of bits Impact on accuracy

• More is not better We cannot afford everything we want

• Basic Arithmetic Logic Unit (ALU) Design Addition/subtraction, multiplication, division

Page 4: ALU Architecture and ISA Extensions

(4)

Reminder: ISAbyte addressed memory

0xFFFFFFFF

Arithmetic Logic Unit (ALU)

0x000x010x020x03

0x1FProcessor Internal Buses

Memory InterfaceRegister File (Programmer Visible State)

stack

Data segment(static)

Text Segment

Dynamic Data

Reserved

Program Counter

Programmer Invisible State

Kernelregisters Who sees what?

Memory MapInstruction register

Page 5: ALU Architecture and ISA Extensions

(5)

Arithmetic for Computers• Operations on integers

Addition and subtraction Multiplication and division Dealing with overflow

• Operation on floating-point real numbers Representation and operations

• Let us first look at integers

Page 6: ALU Architecture and ISA Extensions

(6)

Integer Addition(3.2)• Example: 7 + 6

Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands

Overflow if result sign is 1 Adding two –ve operands

Overflow if result sign is 0

Page 7: ALU Architecture and ISA Extensions

(7)

Integer Subtraction• Add negation of second operand• Example: 7 – 6 = 7 + (–6)

+7: 0000 0000 … 0000 0111–6: 1111 1111 … 1111 1010+1: 0000 0000 … 0000 0001

• Overflow if result out of range Subtracting two +ve or two –ve operands, no overflow Subtracting +ve from –ve operand

o Overflow if result sign is 0 Subtracting –ve from +ve operand

o Overflow if result sign is 1

2’s complement representation

Page 8: ALU Architecture and ISA Extensions

(8)

ISA Impact• Some languages (e.g., C) ignore overflow

Use MIPS addu, addui, subu instructions• Other languages (e.g., Ada, Fortran) require

raising an exception Use MIPS add, addi, sub instructions On overflow, invoke exception handler

o Save PC in exception program counter (EPC) registero Jump to predefined handler addresso mfc0 (move from coprocessor register) instruction can

retrieve EPC value, to return after corrective action (more later)

• ALU Design leads to many solutions. We look at one simple example

Page 9: ALU Architecture and ISA Extensions

(9)

• Build a 1 bit ALU, and use 32 of them (bit-slice)

ba

operation

result

op a b res

Integer ALU (arithmetic logic unit)(B.5)

Page 10: ALU Architecture and ISA Extensions

(10)

Single Bit ALU

0

1A

B

Result

Operation

Implements only AND and OR operations

Page 11: ALU Architecture and ISA Extensions

(11)

• We can add additional operators (to a point)

• How about addition?

• Review full adders from digital design

Adding Functionality

cout = ab + acin + bcin

sum = a b cinSum

CarryIn

CarryOut

a

b

Page 12: ALU Architecture and ISA Extensions

(12)

Building a 32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

Page 13: ALU Architecture and ISA Extensions

(13)

• Two's complement approach: just negate b and add 1.

• How do we negate?

• A clever solution:

Subtraction (a – b) ?

Binvert

b31

b0

b1

b2

Result31a31

Result0

CarryIn

a0

Result1a1

Result2a2

Operation

ALU0CarryIn

CarryOut

ALU1CarryIn

CarryOut

ALU2CarryIn

CarryOut

ALU31CarryIn

0

2

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b

sub

Page 14: ALU Architecture and ISA Extensions

(14)

• Need to support the set-on-less-than instruction(slt) remember: slt is an arithmetic instruction produces a 1 if rs < rt and 0 otherwise use subtraction: (a-b) < 0 implies a < b

• Need to support test for equality (beq $t5, $t6, $t7) use subtraction: (a-b) = 0 implies a = b

Tailoring the ALU to the MIPS

Page 15: ALU Architecture and ISA Extensions

(15)

Seta31

0

ALU0 Result0

CarryIn

a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Binvert

CarryIn

Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

What Result31 is when (a-b)<0?

0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

Unsigned vs. signed support

Page 16: ALU Architecture and ISA Extensions

(16)

Test for equality• Notice control lines:

000 = and001 = or010 = add110 = subtract111 = slt

• Note: zero is a 1 when the result is zero!

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

Note test for overflow!

Page 17: ALU Architecture and ISA Extensions

(17)

ISA View

• Register-to-Register data path• We want this to be as fast as possible

ALU

$0$1

$31

CPU/Core

Page 18: ALU Architecture and ISA Extensions

(18)

Multiplication (3.3)• Long multiplication

1000× 1001 1000 0000 0000 1000 1001000

Length of product is the sum of operand lengths

multiplicand

multiplier

product

Page 19: ALU Architecture and ISA Extensions

(19)

A Multiplier• Uses multiple adders

Cost/performance tradeoff

Can be pipelined Several multiplication performed in parallel

Page 20: ALU Architecture and ISA Extensions

(20)

MIPS Multiplication• Two 32-bit registers for product

HI: most-significant 32 bits LO: least-significant 32-bits

• Instructions mult rs, rt / multu rs, rt

o 64-bit product in HI/LO mfhi rd / mflo rd

o Move from HI/LO to rdo Can test HI value to see if product

overflows 32 bits mul rd, rs, rt

o Least-significant 32 bits of product –> rd

Study Exercise: Check out signed and unsigned multiplication with QtSPIM

Page 21: ALU Architecture and ISA Extensions

(21)

Division(3.4)• Check for 0 divisor• Long division approach

If divisor ≤ dividend bitso 1 bit in quotient, subtract

Otherwiseo 0 bit in quotient, bring down

next dividend bit• Restoring division

Do the subtract, and if remainder goes < 0, add divisor back

• Signed division Divide using absolute values Adjust sign of quotient and

remainder as required

10011000 1001010 -1000 10 101 1010 -1000 10

n-bit operands yield n-bitquotient and remainder

quotient

dividend

remainder

divisor

Page 22: ALU Architecture and ISA Extensions

(22)

Faster Division• Can’t use parallel hardware as in multiplier

Subtraction is conditional on sign of remainder• Faster dividers (e.g. SRT division) generate

multiple quotient bits per step Still require multiple steps

• Customized implementations for high performance, e.g., supercomputers

Page 23: ALU Architecture and ISA Extensions

(23)

MIPS Division• Use HI/LO registers for result

HI: 32-bit remainder LO: 32-bit quotient

• Instructions div rs, rt / divu rs, rt No overflow or divide-by-0

checkingo Software must perform checks if

required Use mfhi, mflo to access result

Study Exercise: Check out signed and unsigned division with QtSPIM

Page 24: ALU Architecture and ISA Extensions

(24)

ISA View

• Additional function units and registers (Hi/Lo)• Additional instructions to move data to/from

these registers mfhi, mflo

• What other instructions would you add? Cost?

ALU

Hi

Multiply Divide

Lo

$0$1

$31

CPU/Core

Page 25: ALU Architecture and ISA Extensions

(25)

Floating Point(3.5)• Representation for non-integral numbers

Including very small and very large numbers• Like scientific notation

–2.34 × 1056

+0.002 × 10–4

+987.02 × 109

• In binary ±1.xxxxxxx2 × 2yyyy

• Types float and double in C

normalized

not normalized

Page 26: ALU Architecture and ISA Extensions

(26)

IEEE 754 Floating-point Representation

2928272625242322212019181716151413121110 9 8 7 6 5 4 3 2 1 03130S exponent significand

1bit 8 bits 23 bits

6160595857565554535251504948474645444342414039383736353433326362S exponent significand

1bit 11 bits 20 bitssignificand (continued)

32 bits

Single Precision (32-bit)

Double Precision (64-bit)

(–1)sign x (1+fraction) x 2exponent-127

(–1)sign x (1+fraction) x 2exponent-1023

Page 27: ALU Architecture and ISA Extensions

(27)

Floating Point Standard• Defined by IEEE Std 754-1985• Developed in response to divergence of

representations Portability issues for scientific code

• Now almost universally adopted• Two representations

Single precision (32-bit) Double precision (64-bit)

Page 28: ALU Architecture and ISA Extensions

(28)

FP Adder Hardware• Much more complex than integer adder• Doing it in one clock cycle would take too long

Much longer than integer operations Slower clock would penalize all instructions

• FP adder usually takes several cycles Can be pipelined

Example: FP Addition

Page 29: ALU Architecture and ISA Extensions

(29)

FP Adder Hardware

Step 1

Step 2

Step 3

Step 4

Page 30: ALU Architecture and ISA Extensions

(30)

FP Arithmetic Hardware• FP multiplier is of similar complexity to FP

adder But uses a multiplier for significands instead of an

adder• FP arithmetic hardware usually does

Addition, subtraction, multiplication, division, reciprocal, square-root

FP integer conversion• Operations usually takes several cycles

Can be pipelined

Page 31: ALU Architecture and ISA Extensions

(31)

ISA Impact• FP hardware is coprocessor 1

Adjunct processor that extends the ISA• Separate FP registers

32 single-precision: $f0, $f1, … $f31 Paired for double-precision: $f0/$f1, $f2/$f3, …

o Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s

• FP instructions operate only on FP registers Programs generally do not perform integer ops on FP

data, or vice versa More registers with minimal code-size impact

Page 32: ALU Architecture and ISA Extensions

(32)

ISA View: The Co-Processor

• Floating point operations access a separate set of 32-bit registers Pairs of 32-bit registers are used for double precision

ALU

Hi

Multiply Divide

Lo

$0$1

$31

FP ALU

$0$1

$31

BadVaddrStatus

CausesEPC

CPU/Core Co-Processor 1

Co-Processor 0

later

Page 33: ALU Architecture and ISA Extensions

(33)

ISA View• Distinct instructions operate on the floating

point registers (pg. A-73) Arithmetic instructions

o add.d fd, fs, ft, and add.s fd, fs, ft

• Data movement to/from floating point coprocessors mcf1 rt, fs and mtc1 rd, fs

• Note that the ISA design implementation is extensible via co-processors

• FP load and store instructions lwc1, ldc1, swc1, sdc1

o e.g., ldc1 $f8, 32($sp)

single precisiondouble precision

Example: DP Mean

Page 34: ALU Architecture and ISA Extensions

(34)

Associativity• Floating point arithmetic is not commutative• Parallel programs may interleave operations in

unexpected orders Assumptions of associativity may fail

(x+y)+z x+(y+z)x -1.50E+38 -1.50E+38y 1.50E+38z 1.0 1.0

1.00E+00 0.00E+00

0.00E+001.50E+38

Need to validate parallel programs under varying degrees of parallelism

Page 35: ALU Architecture and ISA Extensions

(35)

Performance Issues• Latency of instructions

Integer instructions can take a single cycle Floating point instructions can take multiple cycles Some (FP Divide) can take hundreds of cycles

• What about energy (we will get to that shortly)• What other instructions would you like in

hardware? Would some applications change your mind?

• How do you decide whether to add new instructions?

Page 36: ALU Architecture and ISA Extensions

(36)

Characterizing Parallelism

• Characterization due to M. Flynn*

SISD SIMD

MISD MIMD

Single instruction multiple data stream computing, e.g., SSE

Data StreamsIn

stru

ctio

n St

ream

sToday serial computing cores

(von Neumann model)

Today’s Multicore

*M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960t

Page 37: ALU Architecture and ISA Extensions

(37)

Parallelism Categories

From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy

Page 38: ALU Architecture and ISA Extensions

(38)

Multimedia (3.6, 3.7, 3.8)• Lower dynamic range and precision

requirements Do not need 32-bits!

• Inherent parallelism in the operations

Page 39: ALU Architecture and ISA Extensions

(39)

Vector Computation• Operate on multiple data elements (vectors) at

a time• Flexible definition/use of registers

• Registers hold integers, floats (SP), doubles DP)

1x128 bit integer

4 x 32-bit single precision

2x64-bit double precision

8x16 short integers

128-bit Register

Page 40: ALU Architecture and ISA Extensions

(40)

Processing Vectors

Memory

vector registers

• When is this more efficient?

• When is this not efficient?• Think of 3D graphics, linear algebra and media

processing

Page 41: ALU Architecture and ISA Extensions

(41)

Case Study: Intel Streaming SIMD Extensions

• 8, 128-bit XMM registers X86-64 adds 8 more registers XMM8-XMM15

• 8, 16, 32, 64 bit integers (SSE2)• 32-bit (SP) and 64-bit (DP) floating point• Signed/unsigned integer operations • IEEE 754 floating point support• Reading Assignment:

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

Page 42: ALU Architecture and ISA Extensions

(42)

Instruction Categories• Floating point instructions

Arithmetic, movement Comparison, shuffling Type conversion, bit level

• Integer• Other

e.g., cache management• ISA extensions!• Advanced Vector

Extensions (AVX) Successor to SSE

register

registermemory

Page 43: ALU Architecture and ISA Extensions

(43)

Arithmetic View• Graphics and media processing operates on

vectors of 8-bit and 16-bit data Use 64-bit adder, with partitioned carry chain

o Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors SIMD (single-instruction, multiple-data)

• Saturating operations On overflow, result is largest representable value

o c.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video

4x16-bit 2x32-bit

Page 44: ALU Architecture and ISA Extensions

(44)

SSE Example// A 16byte = 128bit vector structstruct Vector4{            float x, y, z, w;        };

// Add two constant vectors and return the resulting vectorVector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ){        Vector4 Ret_Vector;

        __asm         {                        MOV EAX Op_A                              // Load pointers into CPU regs                MOV EBX, Op_B

                MOVUPS XMM0, [EAX]                 // Move unaligned vectors to SSE regs                MOVUPS XMM1, [EBX]

                ADDPS XMM0, XMM1                   // Add vector elements                MOVUPS [Ret_Vector], XMM0      // Save the return vector        }        return Ret_Vector;}

From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

More complex example (matrix

multiply) in Section 3.8 – using AVX

Page 45: ALU Architecture and ISA Extensions

(45)

Intel Xeon Phi

ww

w.a

nand

tech

.com

www.anandtech.com

www.techpowerup.com

Page 46: ALU Architecture and ISA Extensions

(46)

Data Parallel vs. Traditional Vector

Vector Register

A

Vector Register

B

Vector Register

Cpipelined functional unit

registers

Vector Architecture

Data Parallel Architecture

Process each square in parallel – data parallel

computation

Page 47: ALU Architecture and ISA Extensions

(47)

ISA View

• Separate core data path• Can be viewed as a co-processor with a distinct

set of instructions

ALU

Hi

Multiply Divide

Lo

$0$1

$31

Vector ALU

XMM0XMM1

XMM15

CPU/Core SIMD Registers

Page 48: ALU Architecture and ISA Extensions

(48)

Domain Impact on the ISA: Example

• Floats• Double precision• Massive data• Power

constrained

• Integers• Lower precision• Streaming data• Security support• Energy

constrained

Scientific Computing Embedded Systems

Page 49: ALU Architecture and ISA Extensions

(49)

Summary• ISAs support operations required of application

domains Note the differences between embedded and

supercomputers! Signed, unsigned, FP, SIMD, etc.

• Bounded precision effects Software must be careful how hardware used e.g.,

associativity Need standards to promote portability

• Avoid “kitchen sink” designs There is no free lunch Impact on speed and energy we will get to this later

Page 50: ALU Architecture and ISA Extensions

(50)

Study Guide• Perform 2’s complement addition and subtraction

(review)• Add a few more instructions to the simple ALU

Add an XOR instruction Add an instruction that returns the max of its inputs Make sure all control signals are accounted for

• Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review)

• Execute the SPIM programs (class website) that use floating point numbers. Study the memory/register contents via single step execution

Page 51: ALU Architecture and ISA Extensions

(51)

Study Guide (cont.)• Write a few simple SPIM programs for

Multiplication/division of signed and unsigned numberso Use numbers that produce >32-bit resultso Move to/from HI and LO registers ( find the instructions

for doing so) Addition/subtraction of floating point numbers

• Try to write a simple SPIM program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers)

• Look up additional SIMD instruction sets and compare AMD NEON, Altivec, AMD 3D Now

Page 52: ALU Architecture and ISA Extensions

(52)

Glossary• Co-processor• Data parallelism• Data parallel

computation vs. vector computation

• Instruction set extensions

• Overflow• MIMD

• Precision• SIMD• Saturating

arithmetic• Signed arithmetic

support• Unsigned

arithmetic support

• Vector processing