Selected Topics of VLSI Design - uni-rostock.de€¦ · Textbooks Parhami, B.: Computer Arithmetic, Algorithmsand Hardware Designs, 2nd edition, Oxford University Press, New York,

Institute ofApplied Microelectronics & Computer Engineering

Selected Topics of VLSI Design

Prof. Dr.-Ing. Dirk [email protected]

Please note this name change

3/31/2019 Selected Topics of VLSI Design 2

Module AdvancedVLSI Design

Selected Topics of

VLSI Design

Until2016

Short name „Chip project" "HW-Alg."Semester summer winter

SWS 1 1/1/1

Contenthardwarealgorithms VLSI chip project

Starting2017

Short name "Chip project" "HW-Alg."Semester winter summer

SWS 1 1/1/1ETCS 6 6

ContentVLSI chip project hardware

algorithms

● Lecture: Hardware oriented arithmetic algorithms and cryptography

● Exercise: Algorithms, building blocks, VHDL coding● Lab: during project week● Schedule: lecture, Monday 15:xx – 16:yy

exercise, replaces lectures beginning with xx.y.mandatory lab with attendance list: 11.6.-12.6. 9:00

● Location: Warnemuende, building 1, R 1226

Organization

3/31/2019 3Selected Topics of VLSI Design

Textbooks● Parhami, B.: Computer Arithmetic, Algorithms and Hardware Designs,

2nd edition, Oxford University Press, New York, 2010. ● Koren, I.: Computer Arithmetic Algorithms, 2002● Muller, J.M.: Elementary Functions, Algorithms and Implementation,

2nd ed., 2006● Klar, H., Noll, T.: Integrierte Digitale Schaltungen, Springer 2015, free

access from URO network● Pirsch, P.: Architekturen der digitalen Signalverarbeitung B.G. Teubner,

Stuttgart, 1996

Courses and Websites● Koren, I.: Computer arithmetic- Simulator ● Ercegovac, M.: Course Digital Arithmetic● Guyot, A. : Educational Applets● Strey, A.: Course Computer-Arithmetik

Literature

31.03.2019 Selected Topics of VLSI Design 4



Part 1: Number Systems


● 1.1 Positional / Place-Value Notation of Numberso Representation of Integer Numbers, Real Numbers and Radix Selection

● 1.2 Signed Number Representationso Sign Magnitude, (r-1)-Complement, r-Complement and Redundant Binary

● 1.3 Roundingo via Truncation, Round-to-Nearest and Round-to-Nearest-Even

● 1.4 Overflowso in (r-1)-Complement, Carry-Save and Signed Redundant Binary Numberso Overflow Detection and Handling

● 1.5 Basic Operations● 1.6 Cost/Performance Estimation Basics

Outline


● The number A is represented by n digits ai and a defined base/radix r

1.1 Positional / Place-Value Notation of Numbers


o Binary r = 2o Ternary r = 3o Octal r = 8o Decimal r = 10o Hexadecimal r = 16

𝒊

● The value V(A) of the number A is given by the sum of the n partial products pi for each of its positions

● The partial product pi = ai ∙ri results from the multiplication of the digit aiwith its weight ri, which is a power of the radix r and determined by the position index i

● Integer number A with n digits Value V(A)



● A positive integer number A has a range of: 0 V 𝐴 𝑟

● Real numbers contain n digits for the integer part and m digits for the fractional part

● A positive real number A has a range of: 0 V 𝐴 𝑟

● Real number A with n+m digits Value V(A)

● In computation two formats for the approximation of real number exist



● Fixed point numbers the number of significant digits before and after the decimal point is fixed (as seen above)o decimal point is fixed and never explicitly represented in hardwareo its position is defined during design and must be known to interpret

the number● Floating point numbers The number of significant digits before

and after the decimal point depends on exponent

𝑺 𝒆𝒙𝒑𝒐𝒏𝒆𝒏𝒕

● Floating point numbers in modern computer IEEE 754 standardo Binary half precision 16 bit data words(= 1 + 10 + 5 bit)o Binary single precision 32 bit data words(= 1 + 23 + 8 bit)o Binary double precision 64 bit data words(= 1 + 52 + 11 bit)o …

● Radix Selection (cont’d from fixed point numbers)o Computations are performed in circuitso Binary representation (r = 2) is best representation for physical

signal levels in most logic Voltage U: { 0 , 1 } { VSS , VDD } Current I : { 0 , 1 } { IMIN , IMAX }

● Efficiency: How many bits do we need in a bit-oriented (r = 2) memory to store a positive number V ?



● Positional notation as discussed above only covers positive numbers● For negative number different signed number representations (SNRs)

options exist ● SNR #1: Sign Magnitude (SM)

o Insert sign bit at an-1 before magnitude of numbero Positive number an-1 = 0 and Negative number an-1 = 1

1.2 Signed Number Representations


𝐴 0 𝑎 … 𝑎 𝑎

𝐴 𝑟 1 𝑎 … 𝑎 𝑎

● A signed integer number A has a range of:𝑟 V 𝐴 𝑟

● + Symmetrical range● - Double representation of zero, requires different treatment of positive

and negative numbers in arithmetic circuits

● SNR #2: (𝒓-1)-complement 1‘s complemento Negative number results from complementing each digit 𝑎

according to: 𝑎 𝑟 1 𝑎o In a binary representation (𝑟 2) this procedure equals a bitwise

inversion („bit flipping“): 01010101 10101010



𝐴 0 𝑎 … 𝑎 𝑎 𝐴 𝐴 𝑟 1 𝑎 … 𝑎 𝑎

● A signed integer number A has a range of: 𝑟 V 𝐴 𝑟

● Same pros and cons as SM

● SNR #3: 𝒓- complement 2‘s complemento Start with (𝑟 -1)-complement and add 1 to the Least Significant

Digit (LSD)o Binary format (r = 2) most commonly used in digital circuits



𝐴 0 𝑎 … 𝑎 𝑎 𝐴 𝐴 𝑟 1 𝑎 … 𝑎 𝑎 1

● A signed integer number A has a range of: 𝑟 V 𝐴 𝑟

● + Identical treatment of positive and negative numbers in arithmetic circuits, e.g., adders; unique representation of zero

● - Asymmetrical range

● SNR #4: Redundant Representations (RR)o Allow multiple (redundant) representations for the same number

values V(A)o Also true for SM and (𝑟 -1)-complement due to double zero

representation, but typically RR means the following:

● RR #1: Signed Digit Representation (SD)o In SD numbers each digit has its own sign one extra bit per digit



o α and β must cover at least half of the interval defined by the radix

● SD number system is symmetrical for 𝛼 = 𝛽, else asymmetrical ● Maximum or minimum redundancy for symmetrical SD number system:

o Maximum redundancy 𝛼 𝑟 1

o Minimum redundancy 𝛼

● Examples:



Radix r Digit values ai

2 {-1, 0, 1}

3{-2, -1, 0, 1}{-1, 0, 1, 2}{-2, -1, 0, 1, 2}

4{-2, -1, 0, 1, 2} minimum redundancy{-3, -2, -1, 0, 1} not allowed! α, β bounds violated{-3, -2, -1, 0, 1, 2, 3} maximum redundancy

● Only SD numbers with 𝑟 = 2 (redundant binary (RB) numbers) are considered in the following sections 𝑎 ∈ 1, 0, 1

● A SD number with 𝑛 digits of 𝑎 ∈ 1, 0, 1 has 3 different representations, but only 2 1 different values can be representedo Example: 3 011 101 111

● Question: Which RB representation contains the smallest amount of non-zeros (‘1‘ or ‘-1’)?o Answer: use arithmetic conversions of non-zero bit-strings

Example: … 001111 … 111000 … … 010000 … 001000 …



𝑉 𝐴 2 2 ⋯ 2 2 2 2

𝑉 𝐴 2 2 ⋯ 2 2 2

● Such RBRs are called Canonical Signed Digits (CSD) and the conversion strategy is CSD-Recoding

● Definition: A CSD recoded number is an 𝑛 digit SD number that has a minimum amount of non-zeros (‘1’ and ‘-1’) and no adjacent non-zero digits



𝒂𝒊

𝒏 𝟏

𝒊 𝟎

≝ 𝒎𝒊𝒏 𝑤𝑖𝑡ℎ 𝒂𝒊 · 𝒂𝒊 𝟏 ≝ 𝟎 𝑓𝑜𝑟 1 𝑖 𝑛 1

● CSD-Recoding operates as iterative and sequential algorithm. Step by step the number is parsed from the least to the most significant digit/bit (“right to left”) to detect strings of adjacent non-zeros, which are converted immediately. The algorithm terminates if the formulated condition of 𝑎 · 𝑎 ≝ 0 is met!

36610 = 0001 0110 1110= 0001 0111 0010= 0001 1001 0010

CSD = 0010 1001 0010

-21310 = 1111 0010 1011= 1111 0010 1101= 1111 0011 0101= 1111 0101 0101

CSD = 0001 0101 0101

● Lookup-table for CSD-Recodingo possible 1 or 1 carries from lower positions must be considered

(𝒄𝒊 𝟏 𝒄𝒊 in next step)



Binary Number CSD recoded SD𝒂𝒊 𝟏 𝒂𝒊 𝒄𝒊 𝒂𝒊

∗ 𝒄𝒊 𝟏 Comment0 0 0 0 0 String of zeros0 1 0 1 0 Singular non-zero1 0 0 0 0 String of zeros1 1 0 1 1 Begin of non-zero string0 0 1 1 0 End of non-zero string0 1 1 0 1 String of non-zeros1 0 1 1 1 Singular zero1 1 1 0 1 String of non-zeros

o CSD recoding yields minimum | average | maximum minimum # of non-zeros: 0 𝑡𝑟𝑖𝑣𝑖𝑎𝑙! ∼

● Number dependent variable timing and sequential nature of CSD-recoding prohibit its efficient application at run-time. But it is excellent for the recoding of constant values or coefficients at design-time. o Each eliminated non-zero saves hardware and speeds up specific

arithmetic circuits (i.e. multipliers)

● Alternatively, parallel algorithms Booth and modified Booth will work faster for non-zero recoding at run-time.o However, Booth algorithm does not find the minimal form in each

case (see example)o Isolated non-zeros “010“ are not considered by this version



Binary Number Booth recoded SD (a-1 = 0)𝒂𝒊 𝒂𝒊 𝟏 𝒂𝒊

∗ Comment0 0 0 String of zeros1 1 0 String of non-zeros1 0 1 Begin of non-zero string0 1 1 End of non-zero string

● Modified Booth improves on standard Booth algorithm by overlapped bit scanning of 3 bit strings

o By considering isolated non-zeros “010“ the maximum amount of non-zeros after conversion is n/2 (for even numbers of n)



Binary NumberModified Booth recoded SD ( i = 1,3,5,… )

r = 2 r = 4Comment

𝒂𝒊 𝒂𝒊 𝟏 𝒂𝒊 𝟐 𝒂𝒊∗ 𝒂𝒊 𝟏

∗

0 0 0 0 0 0 String of zeros0 1 0 0 1 1 Single non-zero1 0 0 1 0 -2 Begin of non-zero string1 1 0 0 1 -1 Begin of non-zero string0 0 1 0 1 1 End of non-zero string0 1 1 1 0 2 End of non-zero string1 0 1 0 1 -1 Single zero1 1 1 0 0 0 String of non-zeros

● Example: Modified Booth Recoding for a 12 digit numbero Result is no CSD, but acceptable!



1110110100 0 00=33610

r = 2

r = 4

0 0 0 11 00 1 00 1 0

0 1 2 1 0 2

n = 12

● Comparison of recoding methods

Method Algorithm# of non-zeros

CommentMin Average Max

CSD Sequential 0 ~𝑛3

𝑛 12

Yields minimum # of non-zeros for constant values at design-time

Booth Parallel 0 ? ~ 𝑛 1’s Complement to Signed Digit

ModifiedBooth Parallel 0 ? ~ 𝑛 1

2for run-time recoding (multiplier) potential to save half of the chip area

● Conversion from 2‘s complement to SD numbers (Ar ASD)

𝐴𝑟 𝑎 𝑎 … 𝑎 𝑎 𝐴𝑆𝐷 𝑎 𝑎 … 𝑎 𝑎

o Fast: can be done in parallel within one gate delayo 510

01012 0101𝑆𝐷2

o 510 10112 1011𝑆𝐷2



● Conversion from SD numbers to 2‘s complement (ASD Ar)o Split SD number into positive and negative fraction ASD D+ and D-

o -1310 = 010111SD D+ = 000101 and D- = 010010o 2‘s complement number Ar = D+- D-

o Slow: requires one n-bit addition

● Conversion from 2‘s complement to SD numbers (Ar ASD)?o For positive numbers ASD = Ar

o For negative numbers Not this easy!o A general method Booth-Recoding! ASD = fBooth(Ar)



1

(0)1010

1 11

510 =2's Complement

Signed Digits

● Conversion from SD numbers to 2‘s Complement (ASD Ar)?o Split SD number into positive and negative fraction ASD D- & D+

o -1310 = 010111SD D- = 010010 & D+ = 000101o 2‘s Complement number Ar = D+- D-

o ASD Ar conversion requires run-time of an adder circuit! Slow!

● SD numbers in binary hardwareo Three possible values per digit 𝑎 ∈ 1, 0, 1 require two bit for

each digit Hardware costs (wires, registers, ALUs) doubleo Two bit per digit allow four different encodings, but two of them are

typically used Sign Value & Negative Positive



Sign Value (SV) Negative Positive (NP)ai S V N P-1 1 1 1 00 0 0 0 01 0 1 0 1

comment intuitive because of its Sign Magnitude representation

easier ASDAr conversion as D+ = Pn-1….P0 and D- = Nn-1…N0

● RR #2: Carry-Save Representation (CS)o Carry-Save numbers originate from hardware structures of full

adders (FAs) and half adders (HAs)



o Digit 𝑎 represents a tuple: 𝑎 𝑠 𝑐 2 · 𝑐 𝑠o CS numbers are stored as combination of a carry- and intermediate

sum-vector

● Additions with CS number only have a critical path of one half adder, but require 2 bit per digit storage and communication (wires)



● 𝑉 𝐴 𝑐 𝑐 𝑐 … . 𝑐 𝑐

𝑠 𝑠 … . 𝑠 𝑠

● In general, there is no difference between CS and SD numberso CS numbers result from the outputs of a half adder (HA)o SD numbers have their origin in theory of number representations

● Why should RRs be applied or when is it worth to use them?



Pros Cons- Carry-free and thus faster addition /

subtraction (see adder section)- Arithmetic algorithms based on adders

(nearly all) benefit from this

- More resources- Comparison operations (≥, ≤, <, =, >)

are slow due to ASDAr conversion- ASDAr conversion slow due to adder

Ar ASD

Operation 1…

Operation k

ASD Ar

T ~ O(1)

Top,i ≠ f( )Carry-free operations!

Tadd ~ O(log2( ))

● Rounding trims numbers into formats with fewer digits o Examples

Two n bit numbers are multiplied and the result will be a number with 2n bits, but hardware only captures m < 2n bits

Rounding after right shift by one digit of an integer

● Rounding methods can be classified as follows:o Accuracy of the final results (or information loss by rounding)o Numerical error characteristics of the rounding methodo Cost/effort/delay to perform the rounding

● Assumeo Given: 𝐴 𝑎 𝑎 … 𝑎 𝑎 . 𝑎 … 𝑎 Cut 𝑑 bits o Rounded: 𝐵 𝑏 𝑏 … 𝑏 𝑏 𝐴 𝜀 ⇒ 𝜀 𝐵 𝐴o Goal: Minimize rounding error 𝜀

1.3 Rounding


● Rounding Method #1: Truncationo Step 1: 𝑑 least significant bits are cut off from 𝐴o Rounding result 𝐵 𝑎 𝑎 … 𝑎 𝑎o Minimum error 𝜀 0.000002

o Maximum error 𝜀 1 2 0.111112

o Average error 𝜀 0.1000012

o Asymmetrical bias

1.3 Rounding


Position –(𝑑+1)

A

B

1 2 3 4 5

1

2

3

4

● Rounding Method #2: Round-to-Nearesto Step 1: Addition of 0.510 to 𝐴 ⇒ 𝐴 𝐴 0.5 𝐴 0.1o Step 2: 𝑑 least significant bits are cut off from 𝐴 to fit 𝐵o Resulting effect is an alternate rounding to higher & lower numberso Rounding result 𝐵 𝑎 𝑎 … 𝑎 𝑎o Minimum error 𝜀 0.00000 (for A=0.0 B=0.0)o Maximum error 𝜀 2 0.1 (for A=0.1 B=1.0)

o Average error 𝜀 2 0.01

o Smaller asymmetrical bias (due to always rounding up of A=0.1)

1.3 Rounding


A

B

can be often incorporated effortlessly into previous operation

1 2 3 4 5

1

2

3

4

● Rounding Method #3: Round-to-Nearest-Eveno Step 1: Addition of 0.510 to 𝐴 ⇒ 𝐴 𝐴 0.5 𝐴 0.1o Step 2: 𝑑 least significant bits of 𝐴 are zero cut off from 𝐴 to fit 𝐵 and

set 𝑎 to zero, otherwise proceed with Round-to-Nearesto Yields average bias of zero!

o 𝐵 ,𝐵 𝑖𝑓 𝑎 … 𝑎 0.000 …

𝑎 𝑎 … 𝑎 0 𝑒𝑙𝑠𝑒

o 𝑏𝑖𝑎𝑠 0o Symmetrical error and bias-free, mandatory in IEEE Floating Point

1.3 Rounding


Idea: alternate rounding up and down to nearest even number

1 2 3 4 5

1

2

3

4

● Overflow occurs if numbers exceed available word length in datapaths

1.4 Overflow


000.

..0

111.

..1

011.

..110

0...0

-2 n-1 2 n-1 2 n0

unsigned

2´s complement

1´s complement

sign magnitude

● Overflow in 2‘s complement numberso range 2 V 𝐴 2o Overflow in addition of two numbers

Reason: Carry out from sign digit is discarded Case 1: Two positive summands A and B negative sum S

𝑎 ∧ 𝑏 ∧ 𝑠 ⇒ 𝑐 1, 𝑐 0

Case 2: Two negative summands A and B positive sum S 𝑎 ∧ 𝑏 ∧ 𝑠 ⇒ 𝑐 0, 𝑐 1

In general, overflow occurs for 𝑐 𝑐 at sign digit (for add & sub)

1.4 Overflow


FA

an-1 bn-1

sn-1

cout cin

Saturation Logic

overflows*n-1

● In non-redundant number systems overflows are definitely detectable

● Possible actions after overflow detectiono Emergency stopo Error handlingo Saturation to maximum (01111) or minimum

(10000) number

● Overflow in Carry-Save representationso In redundant number systems two types of overflow exist

True and pseudo overflowo Example: 0.510 + (-0.510) + 0 = 0 !!!

1.4 Overflow


-20 2-1 2-2

0 1 0 0.510

1 1 0 -0.510

0 0 0 0

0 1 0 carry vector = -110

1 0 0 sum vector = -110

o Wrong intermediate result -210 in CS representation would yield correct value 0 if converted to 2’s complement via vector merging addition (VMA) of carry and sum vector

o Test: 1.00 + 1.00 = 10.00 (dropped) Result = 0.00 = 010

o Wrong results possible if other operations are executed on intermediate carry and sum vector

o Example: (0.510 + (-0.510) + 0) ∙ 0.510 = 0 Multiplication with 0.5 equals right shift with sign extension of carry and

sum vector Carry: 1.00 : 2 1.10 = - 0,510 Sum: 1.00 : 2 1.10 = - 0,510

------ VMA: 11.00 = - 110 ≠ 010

o Error becomes obvious after conversion to non-redundant number. However, correct result 0 would fit into given word length

o Those pseudo overflows are detectable and correctable as follows

1.4 Overflow


o Pseudo overflow correction for CS numbers: Given: 𝑐 𝑐 . 𝑐 𝑐 …

𝑠 . 𝑠 𝑠 … 𝑠 Modify to: 𝒄𝟎. 𝑐 𝑐 …

𝒔𝟎. 𝑠 𝑠 …

using 𝑐 𝑐 and 𝑠 𝑐 𝑖𝑓 𝑐 𝑐𝑠 𝑒𝑙𝑠𝑒 𝑠 𝑠 ⨂𝑐 ⨂𝑐

XOR gates can be easily integrated as part of the MSD/MSB adder circuit at low hardware overhead without speed penalty!

o Method works as long as the converted 2‘s complement result fits into the given word length

o Example with pseudo overflow correction:

1.4 Overflow


● In general, a reduction of leading digits of CS numbers can be achievedo provided that CS number fits into corresponding 2’s complement

number according to condition -1 ≤ C + S ≤ (1-2-(n-1))● as follows:

Given: 𝑐 … . 𝑐1𝑐 . 𝑐 𝑐 … 𝑠𝑛 … 𝑠 𝑠 . 𝑠 𝑠 …

Modify to: 𝒄𝟎. 𝑐 𝑐 …𝒔𝟎. 𝑠 𝑠 …

using 𝑠 𝑠 𝑖𝑓 𝑠 𝑐𝑠 𝑒𝑙𝑠𝑒 𝑐 𝑐 𝑖𝑓 𝑠 𝑐

𝑐 𝑒𝑙𝑠𝑒

● Pseudo overflow correction needs less digits and chip area than uncorrected formats

1.4 Overflow


● Overflow in SD numberso Similar to the CS case Pseudo and real overflowso Thereby, the overflow behavior depends on the MSD sum bit 𝑠

and the intermediate carry bit 𝑑o Analysis for possible correction of 𝑠 as follows:

1.4 Overflow


𝑑 𝑠 Overflow Type 𝒔𝒏 𝟏

1 N pseudo 11 0 potential1 1 realN N realN 0 potentialN 1 pseudo N0 X none 𝑠

o Pseudo overflow correctable at MSD without performance impact

o Real overflow must be avoided through modification at the system or algorithm level

o Potential overflow would require an inspection of all lower digits Hardware costs increase

o Potential overflow avoidable via range limitation to 2

● General options/mechanisms for handling of real overflowo Analytical analysis to identify minimum/maximum intermediate and

final valueso Corner case simulation of the system to check for sufficient word

lengths for any occurring valueso Thus estimate lower bound on word lengtho For insufficient word lengths or if too expensive:

Reduce accuracy less bits after decimal point Test whether application allows saturation Detect real overflow and handle it

1.4 Overflow


● Wrap up of some basic operations on data and numbers

1.5 Basic Operations


Operation

Shiftunsigned

left 𝒂𝒏 𝟐 … 𝒂𝟏𝒂𝟎𝟎right 𝟎𝒂𝒏 𝟏𝒂𝒏 𝟐 … 𝒂𝟏

signed2‘s complement

left 𝒂𝒏 𝟏𝒂𝒏 𝟑 … 𝒂𝟎𝟎right 𝒂𝒏 𝟏𝒂𝒏 𝟏 … 𝒂𝟏

Rotateleft 𝒂𝒏 𝟐 … 𝒂𝟏𝒂𝟎𝒂𝒏 𝟏

right 𝒂𝟎𝒂𝒏 𝟏𝒂𝒏 𝟐 … 𝒂𝟏

Extendunsigned

left 𝟎𝒂𝒏 𝟏𝒂𝒏 𝟐 … 𝒂𝟏𝒂𝟎

right 𝒂𝒏 𝟏𝒂𝒏 𝟐 … 𝒂𝟏𝒂𝟎𝟎

signed2‘s complement

left 𝒂𝒏 𝟏𝒂𝒏 𝟏𝒂𝒏 𝟐 … 𝒂𝟏𝒂𝟎

right 𝒂𝒏 𝟏𝒂𝒏 𝟐 … 𝒂𝟏𝒂𝟎𝟎

Saturateunsigned 𝒂𝒏 𝟏 … 𝒂𝒏 𝟏𝒂𝒏 𝟏

signed 2‘s complement 𝒂𝒏 𝟏𝒂𝒏 𝟏 … 𝒂𝒏 𝟏

● Some „Rule of Thumb“ estimations for delay and area of typical functions and algorithm structures in arithmetic circuitso Naming conventions:

𝐴 Area 𝑇 Cycle time/delay 𝐿 Latency # Number of cycles

o Basic assumption for gates: Inverter, Buffer 𝐴 0 , 𝑇 0 (negligible) Simple 2-Input gate 𝐴 1 , 𝑇 1 (AND, NAND, OR, NOR) Special 2-Input gate 𝐴 2 , 𝑇 2 (XOR, XNOR) Complex m-Input gate 𝐴 𝑚 1 , 𝑇 𝑙𝑜𝑔 𝑚 (gate tree) Wiring costs as well as area not considered (high abstraction)

o Basic assumptions for circuit function: Up to 𝑛 inputs 𝑎 𝑎 , 𝑎 , … , 𝑎 , 𝑎 Up to 𝑛 outputs 𝑧 𝑧 , 𝑧 , … , 𝑧 , 𝑧 Blue dots represent functions that generate outputs

𝑧 𝑓 𝑎 , 𝑎 , … , 𝑎 , 𝑎

1.6 Cost/Performance Estimation Basics


o Non-recursive functions 𝑧 𝑓 𝑎 , 𝑥 𝑤𝑖𝑡ℎ 𝑖 0, 1, … , 𝑛 1 𝑎𝑛𝑑 𝑥 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 output 𝑧 only depends on input 𝑎 can be implemented as fully parallel hardware structure 𝐴 𝑂 𝑛 and 𝑇 𝑂 1



o Recursive functions with single output Output depends on all inputs 𝑧 𝑓 𝑎 , 𝑎 , … , 𝑎 , 𝑎 Case 1: 𝑓 non-associative 𝐴 𝑂 𝑛 and 𝑇 𝑂 𝑛 (serial structure) Case 2: 𝑓 associative 𝐴 𝑂 𝑛 and 𝑇 𝑂 𝑙𝑜𝑔 𝑛 (tree structure)



an-1

zn-1

an-2 ... a1 a0

a3

z3

a2 a1 a0

Case 1: non-associative Case 2: associative

o Recursive functions with multiple outputs Prefix problem 𝑧 𝑓 𝑎 , 𝑧 Case 1: f non-associate 𝐴 𝑂 𝑛 and 𝑇 𝑂 𝑛 (serial) Case 2: f associative 𝐴 𝑂 𝑛 and 𝑇 𝑂 𝑙𝑜𝑔 𝑛 (multi tree / serial) Case 3: f associative 𝐴 𝑂 𝑛 ⋅ 𝑙𝑜𝑔 𝑛 and 𝑇 𝑂 𝑙𝑜𝑔 𝑛 (shared)



Case 1: non-associative Case 2: associativean-1

zn-1

an-2

zn-2

...

...

a1

z1

a0

z0

a3

z3

a2

z2

a1

z1

a0

z0

Case 3: associative

a3

z3

a2

z2

a1

z1

a0

z0

inparallel



Part 2: Adders


● 2.1 Fundamentalso Half Adder, Full Adder, (m,k)-Counter

● 2.2 Carry Propagate Adderso Ripple Carry, Carry Skip, Carry Select, Conditional Sum, Carry Lookahead,

Asynchronous

● 2.3 Non-Carry Adderso Carry Save, Redundant Binary

● 2.4 Multi-Operand Adderso Matrix Adder, (m:2)-compressor, Adder Trees

● 2.5 Sequential Adderso LSB-first, MSB-first, Accumulator

● 2.6 Add-based Operations

Outline


● 1 Bit Adder or (𝑚,𝑘)-countero Counting 𝑚 1-bit numbers of same magnitudeo Result: 𝑘 -bit sum,

● Half Adder or (2,2)-counter

2.1 Fundamentals of Adders

6/3/2019 3

a b

scouts

ab

cout

𝑎 𝑏 2 𝑐 𝑠

Sum: 𝑠 𝑎 ⊕ 𝑏

Carry: 𝑐 𝑎 ∧ 𝑏


Metric:𝐴 3𝑇 1𝑇 2

Example:1 1 1 111 1 1 1 100𝑘 𝑙𝑜𝑔 𝑚 1

● Full Adder or (3,2)-counter

Popular variables:


6/3/2019 4

𝑔 𝑎 ∧ 𝑏 ; generate cout

𝑝 𝑎 ⊕ 𝑏 ; propagate cin

𝐶 𝑎 ∧ 𝑏

𝑎 𝑏 𝑐 2 𝑐 𝑠


Composed of Half Adders

𝐶 𝑎 ∨ 𝑏

𝑠 𝑝 ⊕ 𝑐𝑖𝑛

● Full Adder


6/3/2019 5

s

a b

coutcin


𝑐 𝑎 ∧ 𝑏 ∨ 𝑎 ∧ 𝑐 ∨ 𝑏 ∧ 𝑐 𝑐 𝑔 ∨ 𝑝 ∧ 𝑐

𝑠 𝑝 ⊕ 𝑐

Different ways to calculate s and coutOptimal structure depends on technology

a

pg

cout cin

s

b

𝑠 𝑝 ⊕ 𝑐

● Full Adder


6/3/2019 6

0

cout

1

s

cin

p

a bb

cin

s

a

0

1 c1

c0

cout


Metric:𝐴 7𝑇 2𝑇 4

𝑠 𝑝 ⊕ 𝑐 𝐶1 ∧ 𝐶0 ⊕ 𝑐o Mux: 2 Transmission Gateso Transmission Gate:

𝑐 𝑐 ∧ 𝐶 ∨ (𝑐 ∧ 𝐶 )

𝑐 𝑐 ∧ 𝑝 ∨ (𝑎 ∧ 𝑝

● (m,k)-countero Addition of m bits

o Composed of full adderso Addition is associative: linear structure tree structure o Reduced critical path


6/3/2019 7

( m, k )

s0

a0 a1 am-1

sk-1

......


● Example: (7,3)-counter


6/3/2019 8

Linear structure Tree-structure


Metric:𝐴 28𝑇 10

Metric:𝐴 28𝑇 14

● Addition of 2 n-bit operands A, B and optional cin using carry propagation

● Sum: non-redundant (n+1)-bit number

● Different methods of carry propagation:o Ripple Carry Adder (RCA)o Carry Skip Adder (CSkA)o Carry Select Adder (CSel)o Conditional Sum Adder (CSum)o Carry Lookahead Adder (CLA)

2.2 Carry Propagate Adders (CPA)

9

𝐴 𝐵 𝑐 𝑆 2 ⋅ 𝑐

𝑎 𝑏 𝑐 𝑠 2 ⋅ 𝑐 for i 0,1…,n-1

𝑐 𝑐 𝑐 𝑐

CPAcout

A B

S

cin

Selected Topics of VLSI Design6/3/2019

● Serial arrangement of full adders (FA)● Simplest, smallest and slowest CPA

● Carry speed-up strategy for CPAs:o Type A: Partitioning in groups of shorter

CPAs (with fast cin cout)o Type B: Parallelization using tree structure

2.2.1 Ripple Carry Adder (RCA)


Metric:𝐴 7𝑛𝑇 2𝑛

0 0

in

1 1

01

n‐1 n‐1

n‐1

out

● Type A● Idea: determine for each group in parallel whether

a) cin generates cout -or-b) cin can skip this group (“skip + propagate”)

2.2.2 Carry Skip Adder (CSkA)

6/3/2019 11

CPACPACPA

an-1: j

s k-1: 0

bn-1 : j bi-1: kai-1: k a k-1: 0 b k-1: 0

si-1: ksn-1: jPi-1 : k

cout cincj ci ck

c'i0

1

...

...

k‐bit group

𝑐 𝑃 : ∧ 𝑐 ∨ 𝑃 : ∧ 𝑐




𝑃 : 𝑝 ∧ 𝑝 ∧ ⋯ ∧ 𝑝

𝑝 𝑎 ⊕ 𝑏

Group propagate

Bit propagate

Requires k-input AND-gate for each group

● Critical path in a given group = k bits

Function of 𝑃 : :

𝑃 : 0 ⇒ 𝑐 doesn t affect 𝑐

𝑃 : 1 ⇒ 𝑐 determines 𝑐𝑖

propagate 𝑐𝑖′ to mux output

𝑐𝑘 skips group and is propagated to mux input

● Question: Which group size is optimal w.r.t. delay?● Assumptions:

o fixed k for all groups n/k groups of same size



𝑇 𝑘 ∗ 𝑇𝑛𝑘 2 ∗ 𝑇 𝑘 ∗ 𝑇

2 ∗ 𝑘 ∗ 𝑇 𝑛 ∗ 𝑘 2 ∗ 𝑇

4 ∗ 𝑘 𝑛 ∗ 𝑘 2

TMux = 1TCarry = 2



𝑇 , 2 𝑛 𝑛12 𝑛 4

2 𝑛 2 𝑛 4

𝑻𝑪𝑺𝒌𝑨,𝒐𝒑𝒕 4 𝑛 4 𝐎 𝒏

● Further improvementso Faster CPAs, e.g. multi-staged CSkAo Variable group size, overlap TCPA + Tmux with TCPA of next group

larger middle groups. Note that sum time in last group depends on its group size and can only start after all preceding operation have finished for n=32 bit choose 1,2,3,4,5,6,5,3,2,1

o Cost compared to RCA: 1 XOR/bit, (1 AND + 1 Mux)/group

𝑇 𝑘 4 𝑛 ∗ 𝑘 0

⇒ 𝑘12 𝑛

Metric:𝐴 8𝑛𝑇 4 𝑛

● Type A● Idea:

a) Compute cout and sout for both possible cin in each k-bit groupb) Actual cin selects corresponding output and propagates result

2.2.3 Carry Select Adder (CSel)


k/n‐bit adder

cin

Sk‐1:0ci

k/n‐bit adderk/n‐bit

adder

Si‐1:k

10

1 0ck

ak‐1:0 bk‐1:0

𝑠 : 𝑐 ∧ 𝑠 : ∨ 𝑐 ∧ 𝑠 :

𝑐 𝑐 ∧ 𝑐 ∨ 𝑐 ∧ 𝑐

2.2.3 Carry Select Adder (CSel)


● Optimal group size (like CSkA):

● Further improvements:o Faster CPA, e.g. multi-staged CSelo Variable group size k

Overlapping TCPA and TMux with TCPA of next group Increase group size by one bit / group from LSB to MSB

e.g., for 28 bit: 7,6,5,4,3,2,1

● Cost compared to RCAo 1 “sum-Mux”/bit + (CPA+”carry-Mux”)/groupo Note: no duplication of whole CPA

A ⊕ B can be reused -or- Use Binary-to-Excess-1 Code (BEC) Converter for

simpler block with cin =1

Metric:𝐴 14𝑛𝑇 3 𝑛

𝑘12 𝑛

𝑻𝑪𝑺𝒆𝒍 𝐎 𝒏

2.2.3 Carry Select Adder with BEC (extra stuff)


● Binary-to-Excess-1 Code (BEC)for 4 bit, e.g., realizes increment by one

● Simple Generation

● Replace block with cin =1, requiresless area than standard structure

Sum output of block with cin =0

● Hybrid Type A (similar to CSel but 1-bit groups) and Type B (tree)● Parallel propagation of 1-bit groups using tree-structure (instead of

sequential propagation of carries of k-bit groups)● Fastest and most costly CPA exploiting max parallelism

o n-summand bits are propagated by mux tree depth: 𝑙𝑜𝑔 𝑛

o Cost: 2 ∗ 𝑅𝐶𝐴 2 ∗ 𝑙𝑜𝑔 𝑛 Mux/bit

2.2.4 Conditional Sum Adder (CSum)


Metric:𝐴 3𝑛 · 𝑙𝑜𝑔 𝑛 O n · 𝑙𝑜𝑔 𝑛𝑇 2 · 𝑙𝑜𝑔 𝑛 O 𝑙𝑜𝑔 𝑛

● Type B:o parallel tree structureo all carries are pre-computedo if too expensive for large n partitioning into k-bit groupso hierarchical arrangement in ½ log 𝑛 stages

● Implementationso Kogge-Stone (1973): fast, long wires, irregular layouto Brent-Kung (1982): much more regular, bit slowero Han-Carlson (1987): compromise between KS and BKo Ling/Sklansky (1981): large fanout to compute higher bitso Ladner-Fischer (1980): compromise between Ling and BK

2.2.5 Carry Lookahead Adder (CLA)


Metric:𝐴 𝑂 𝑛 · 𝑙𝑜𝑔 𝑛 )𝑇 𝑂 𝑙𝑜𝑔 𝑛 )

● Classification





𝑐 𝑐′𝑐 𝑔 ∨ 𝑝 ∧ 𝑐′𝑐 𝑔 ∨ 𝑝 ∧ 𝑐 𝑔 ∨ 𝑝 ∧ 𝑔 ∨ 𝑝 ∧ 𝑝 ∧ 𝑐′𝑐 𝑔 ∨ 𝑝 ∧ 𝑔 ∨ 𝑝 ∧ 𝑝 ∧ 𝑔 ∨ 𝑝 ∧ 𝑝 ∧ 𝑝 ∧ 𝑐′𝑔′ 𝑔 ∨ 𝑝 ∧ 𝑔 ∨ 𝑝 ∧ 𝑝 ∧ 𝑔 ∨ 𝑝 ∧ 𝑝 ∧ 𝑝 ∧ 𝑔𝑝′ 𝑝 ∧ 𝑝 ∧ 𝑝 ∧ 𝑝…..

Carry Lookahead Block (CLB) c’0

(g0,p0)(gn‐1,pn‐1) ...

(g‘n‐1,p‘n‐1) block generate & propagate

c0cn‐1 ...



● Example for 16b additiono Kogge-Stone Han-Carlson

o Brent-Kung



● a.k.a Carry Completion Adders● detects end of carry propagation and generates carry-completion

signal (indicates validity of sum bits) ● Tcarry,mean = ~ 𝑙𝑜𝑔 𝑛 stages for 𝑛-bit adder

● Pros:o simple RCA with 𝑂 𝑙𝑜𝑔 𝑛o well suited for resource limited architectures/cascaded additions

e.g. crypto hardware on smartcards● Cons:

o only for asynchronous (self-timed) systemso extra hardware for carry completion logic

2.2.6 Asynchronous Adder


Metric:𝐴 8𝑛𝑇 2𝑙𝑜𝑔 𝑛𝑇 2𝑛

● delay is independent of width

● adds 3 n-bit numbers without carry propagation● carry is saved in Carry-Save representation

● Operands can be o Three 2’s complement (TC) numbers oro One 2’s complement number + one CS-number

2.3 Non-Carry-Propagate Adders


2.3.1 Carry Save Adder (CSA)

0 1 2n n n

n n

𝑎 , 𝑎 , 𝑎 , 2𝑐 𝑠 ; 𝑖 0 … 𝑛 1

𝐴 𝐴 𝐴 𝐶 𝑆 𝐶, 𝑆

● Built out of n full adders

● 3 input vectors are merged into 2 output vectors● also called: (3,2)-compressor

2.3.1 Carry Save Adder (CSA)


s0s1 c1c2

a0,n‐1 a1,n‐1

sn‐1cn

a2,n‐1 a0,1 a1,1 a2,1 a0,0 a1,0 a2,0

Metric:𝐴 7𝑛𝑇 4

constant!

● Summation of numbers in Signed Digit Representation (RBA: base r=2)

2.3.2 Redundant Binary Adder (RBA)


𝑎 , 𝑏 , 𝑠 , 𝑑 , 𝑧 ∈ 1,0,1

𝑎 𝑏 2𝑑 𝑧

𝑧 𝑑 𝑠

𝑆 𝐴 𝐵

𝑑 ...intermediate carry𝑧 …intermediate sum



𝑎 𝑏 2𝑑 𝑧

𝑧 𝑑 𝑠

𝑑 intermediate carry𝑧 intermediate sum

Similar to carry save

𝑠 𝑧 𝑑 is carry-free iff:ai bi ai-1 bi-1 di+1 zi

1 1 X X 1 01 0 both 0

else1 -1

0 1 0 1ai+bi=0 X X 0 0

0 -1 both 0else

0 -1-1 0 -1 1-1 -1 X X -1 0 [Takagi, 1987]

𝑋 don‘t care𝑎 0𝑏 0

Digit i is only affecting digits i+1 and i+2 no carry propagation

Metric:𝐴 7𝑛𝑇 ≅ 2𝐹𝐴

● Examples:



01111 𝑎00001 𝑏

011111 𝑑01110 𝑧10000 𝑠

01111 𝑎00111 𝑏

000011 𝑑01000 𝑧01010 𝑠

ai bi ai-1 bi-1 di+1 zi

1 1 X X 1 01 0 Both 0

Else1 -1

0 1 0 1ai+bi=0 X X 0 0

0 -1 Both 0Else

0 -1-1 0 -1 1-1 -1 X X -1 0

00111 𝑎01111 𝑏

010011 𝑑01000 𝑧

11010 𝑠

● Comparison: CSA vs. RBA



Carry Save Adder Redundant Binary AdderFrom 2‘s C direct conversion, no HW needed

To 2’s C add carry and sum vectors(C+S)

split SD-number in positive (1, 0) and negative (1, 0) subtraction

Functionality

CS-cell = FA = (3,2)-compressoradds CS+2C or 2C+2C+2C,

cascaded cell (4:2)-compressor for CS+CS

RB-cell = (4,2)-celladds RB+RB

Complexity~equal at same functionality

22 transistors (3:2),available in libraries

42 transistors (4:2),availability depends on library

2.3.2.1 Overflow in Redundant Binary Adders

● There are real overflow situations, but● also pseudo overflow

example:

● Overflow depends on MSD of sum 𝑠 and intermediate carry 𝑑 :



111 𝑎 1111 𝑏 1

1111 𝑑 20000 𝑧

110 𝑠 6

𝑑 𝑑 𝑑 … 𝑑𝑠 𝑠 … 𝑠 𝑠

● Pseudo overflow prevented by correction rule for sn-1:

● Can be implemented without speed loss in MSD of RBA● Real overflow needs to be handled on system or algorithmic level● Potential overflow

o needs detection on lower bits, oro limit magnitude of all numbers to < 2n-2, oro increase word length by one

2.3.2.1 Overflow in Redundant Binary Adders


dn sn-1 Overflow s‘n-1

1 -1 Pseudo 11 0 Potential1 1 Real avoid or

use saturation-1 -1 Real

-1 0 Potential-1 1 Pseudo -10 X no sn-1

● Summation of 3 or more 𝑛-bit operands ● Result requires non-redundant bits

2.4.1 Multi-operand addition using adder array

a) Linear array of CPAs (example: 4-operand RCA)

2.4 Multi-Operand Adder


𝑚 3𝑛 𝑙𝑜𝑔 𝑚

FA

a0,n-1

FA

FAFA

a2,n-1

a3,n-1

a1,n-1

sn-1sn

FA

a0,2

FA

FA

a2,2

a3,2

a1,2

s2

FA

a0,1

FA

FA

a2,1

a3,1

a1,1

s1

HA

a0,0

HA

HA

a2,0

a3,0

a1,0

s0

(m-1)-CPAs

CPA 1

CPA 2

CPA 3

b) Linear array of CSAs and final CPA (example: RCA)





● Evaluation:o same delay for a) and b)o buto Type a): fast final CPA (e.g. CSum) has to wait for operand arrival,

delay iso Type b): delayfor high performance always use b), type a) is expensive/useless

● Generic scheme for b):

𝑂 𝑛 𝑚𝑂 𝑚 𝑙𝑜𝑔 𝑛

CSA1

A0

2's C

A1 A2 A3 A4

CSA2

CSA3

CPA

𝐴 𝑚 2 · 𝐴 𝐴

𝑇 𝑚 2 · 𝑇 𝑇

𝐴 𝑂 𝑚 · 𝑛 𝑛 · 𝑙𝑜𝑔 𝑛

𝑇 𝑂 𝑚 𝑙𝑜𝑔 𝑛

For logarithmic CPA:

● Idea:o one column (2.4.1 b) without terminating CPA

compresses m input bit to 2 output bit propagates (m-3) carries to left-hand column

● No horizontal carry propagation ● Uses FAs = (3:2)-compressor or (4:2)-cell in linear array or tree-

structureo Example: 4-operand adder with (4:2)-adders

2.4.2 (m:2)-compressors


𝐴 7 ∗ 𝑚 2

𝑇 4 ∗ 𝑚 2

𝑇 6 ∗ 𝑙𝑜𝑔 𝑚 1

● Implementation of (4:2)-adders:



FA

FA

a0 a1

a bcincout

s

a2 a3

cin

s

a bcincout

s

cout

c

0 1cout

0 1

C S

cin

a1a0 a2 a3

𝐴 14𝑇 8

𝐴 16𝑇 6

2 full adders: Optimized structure using tree of XOR gates:

● Advantages of (4:2) versus (3:2) in (m:2) compressorso 4:2 instead of 3:2 (sic!)o reduced deptho regular layout

● Example: (8:2)-compressor:



● Using n-bit m-operand redundant adders● Tree-structure● Each adder consists of n-bit (m:2)-compressors● Fastest multi-operand adders:

o adder tree + log2(n)-CPA

2.4.3 Multi-operand addition using adder trees


𝐴 𝐴 , · 𝑛 𝐴 𝑂 𝑚 · 𝑛 𝑛 · 𝑙𝑜𝑔 𝑛

𝑇 𝑇 , 𝑇 𝑂 𝑙𝑜𝑔 𝑚 𝑙𝑜𝑔 𝑛

● Wallace Tree (1964)o Redundant adder = CSA (3:2)

● Trees are faster than arrays with same number of gates● But: trees require irregular wiring

increased area

2.4.3 Multi-operand addition using adder trees


4:2 4:2 4:2 4:2

4:2

4:2

4:2

3:2 3:2

3:2

3:2

3:2

Wallace Tree: (4:2)-Tree:

● Bitwise adding of 2 n-bit numbers, starting from LSB ● Pros:

o Smallo Serial communicationo Cascadable (LSB-In LSB-Out)

● Cons:o Needs temporary storage flipflopo Latency: n cycles

2.5 Sequential Adders


2.5.1 LSB-first serial adder

𝐴 𝐴 𝐴

𝑇 𝑇 𝑇

𝐿 𝑛 · 𝑇

● Bitwise adding of 2 n-bit numbers, starting from MSD● Seems impossible, but can be derived from parallel (4:2)-adders in CS

(SD as well):

● ai, bi, ai-1, bi-1, ai-2, bi-2 must be known to compute si

● Thus, this “Digit Online Addition” has an online-delay of 𝛿 2

2.5.2 MSD-first serial adder (digit online arithmetic)


● Comparison to LSB-first adder:o Needs conversion to 2’s complemento More wiring (2 wires/digit)o Slower: 𝛿 2

● Why digit online technique?o Add, Sub, Mult are “natural” LSB-first-In/Out operationso But: Division and more complex functions are MSD-first-In/Out

all n input digits have to be known LSB-first-In wait for n cycles MSB-first-Out

o Digit online better suited for mixed and concatenated operations of Add, Div, Log, Sub… all operations can be performed MSD-first, but not using LSB-first MSD-first-In wait ∑ 𝛿 cycles MSD-first-Out Lower overall latency (even than for parallel operations!) is possible

due to overlapping input and output digits But: throughput typically lower than with parallel operations



● Online delays 𝛿 of basic operations



● adds m n-bit operands in parallel ● a)

● b)

b) much faster when using a pipelined CPA

2.5.3 Accumulator


𝐴 𝐴 𝐴

𝑇 𝑇 𝑇

𝐿 𝑚 · 𝑇

𝐴 𝐴 𝐴 𝐴

𝑇 𝑇 𝑇

𝐿 𝑚 · 𝑇

● Increment / Decrement● Counter (feedback increment)● Comparators ( , , , ⋯)

o TC, SD, CS: 𝑇 𝑙𝑜𝑔 𝑛● Detect leading zeroes

o TC, SD, CS: 𝑇 𝑙𝑜𝑔 𝑛● Determine flag bits in processors

2.6 Adder-based Operations




Part 3: Multiplication


● 3.1 Fundamentalso Unsigned Multiplication, 2’s Complement Multiplication

● 3.2 Unsigned Braun-Array Multiplier

● 3.3 Signed Pezaris-Array Multiplier

● 3.4 Booth Multiplier

● 3.5 Booth-Wallace Multiplier

● 3.6 Evaluation

Outline


● Like paper-and-pencil multiplication● Multiplication of 2 n-bit operands A and B yields 2𝑛-bit product

3.1 Fundamentals


3.1.1 Unsigned Multiplication

● Multiply algorithm:1) Generate n partial products 𝑃2) Sum up all partial products 𝑃

Shift-and-Add

𝑃 𝐴 · 𝐵 𝑎 2 · 𝑏 2 𝑎 𝑏 · 2

𝑃 𝑎 · 𝐵 , 𝑃 𝑃 2

(see 1.6.2.2 Recursive, associative function)

Note:𝑃 Product𝑃 Partial product𝑝 Bit 𝑖 of product

a) Recursive (shift-and-add) using one accumulator

b) Serial (shift-and-add) using linear array of CSAso All pi are generated in parallel


3.1.1 Unsigned Multiplication (cont’d)

Reg

ai

B

P

CPACLK

i = 0, ..., n ‐1 Shift left by i bits

1n

2n

*Metric:𝐴 𝑂 𝑛 ⋅ log 𝑛𝑇 𝑂 log 𝑛𝐿 𝑛

CSA

CSA

CSA

CSA

CPA

*

*

*

*

a0

a1

a2

a3

A

B

4n inputof CPA

Carry and sum

2n

Metric:𝐴 𝑂 𝑛𝑇 𝑂 𝑛 log 𝑛

CPA

c) Parallel using multi-operand adder (tree-structure)

3.1.1 Unsigned Multiplication (cont’d)


CPA

*

**A

B

*

CSA ‐ Tree

2n

2n

2n

a0a1a2a3

Metric:𝐴 𝑂 𝑛𝑇 𝑂 log 𝑛

● Option 1 o Complement operands before and result after multiplication Unsigned multiplication algorithm applicable

● Option 2o Use dedicated two’s complement multipliers e.g., Braun, Pezaris, Baugh-Wooley

3.1.2 Two’s Complement Multiplication


● E.g., for 4-bit operands

3.2 Unsigned Braun-Array Multiplier


a0b3 a0b2 a0b1 a0b0

a1b3 a1b2 a1b1 a1b0

a2b3 a2b2 a2b1 a2b0

a3b3 a3b2 a3b1 a3b0

p7 p6 p5 p4 p3 p2 p1 p0

Metric:𝐴 8𝑛 11𝑛 𝑂 𝑛𝑇 6𝑛 9 𝑂 𝑛

ai

bi

pi

Braun ai

bi

pi (MSBs)pi (LSBs)

● 4-bit Braun-Array multiplier

3.2 Unsigned Braun-Array Multiplier (cont’d)


b0

FA FA FA

FAFAFA

FA FA FA

HA HA HA

b3 b2 b1

CPA

p0

p1

p2

p3

p4p5p6p7

a0

a1

a2

a3

CSA

2

1

3

● Modified Braun-Array multiplier, here shown for 4-bit operands● MSB = sign bit value = -1

3.3 Signed Pezaris-Array Multiplier


-a0b3 a0b2 a0b1 a0b0



a3b3 -a3b2 -a3b1 -a3b0

p7 p6 p5 p4 p3 p2 p1 p0

● Four cases for partial product Pi

a) 3 pos. operands regular FA

b) 2 pos., 1 neg. operandsoo Weight of sum-bit: -1o Weight of cout: +2

c) 1 pos., 2 neg. operandsoo Weight of sum-bit: +1o Weight of cout: -2

d) 3 neg. operands logically identical to a) identical implementation: regular FA

3.3 Signed Pezaris-Array Multiplier (cont’d)


𝑎 𝑏 𝑐

𝑎 𝑏 𝑐 2𝑐 𝑠

1 𝑠𝑢𝑚 2

2 𝑠𝑢𝑚 1

𝑎 𝑏 𝑐 2𝑐 𝑠

● b) and c) have same implementation

● Approach: replace FA in regions , , and with modified FA (input a = •)

● Same structure like Braun multiplier (except modified FA)

3.3 Signed Pezaris-Array Multiplier (cont’d)


𝑠 𝑎 ⊗ 𝑏 ⊗ 𝑐𝑐 𝑎 ∧ 𝑏 ∨ 𝑎 ∧ 𝑐 ∨ 𝑏 ∧ 𝑐

(regular FA)(modified FA)

b0

FA FA FA

FAFAFA

FA FA FA

HA HA HA

b3 b2 b1

CPA

p0

p1

p2

p3

p4p5p6p7

a0

a1

a2

a3

CSA

2

1

3

● Observation: multiplication delayo For every 0 in ai one row can be omitted in array!o Recoding of ai to maximize number of 0’s

(𝑎 ∈ 0,1 → 𝑎 ′ ∈ 1,0,1 )● Two possibilities:

a) ai always constant: CSD-Recoding (1/3 of area on average) b) ai variable: modified Booth-Encoding (1/2 of area)

Booth Multiplier

● Note: “horizontal” data compression can be achieved with Dadda-multiplier (Booth = “vertical” compression)

3.4 Booth Multiplier


𝑓 ⋕ partial products 𝑃 𝑓 𝑛

*

CSA - array

CPA

Mod

.Boo

th-

Rec

odin

g

Parallel calculation

ai

bin

n/2 partial products Pi*

**

ai‘ Metric:𝐴 𝑂 𝑛𝑇 𝑂 𝑛 log 𝑛

● take Booth multiplier and replace CSA-array with Wallace-tree (see 2.4.3)

3.5 Booth-Wallace Multiplier


Metric:𝐴 5 … 6𝑛𝑇 𝑂 𝑙𝑜𝑔 𝑛 ; → 𝑇 2 · 𝑙𝑜𝑔 𝑛

CSA tree CPA

3.6 Evaluation of multiplier architectures


Trough-put Latency Area Regularity Pipelining

Recursive - - o ++ - (control needed) - -Braun + o o ++ ++Booth + + o + +Booth-Wallace + ++ - - - +



Part 4: Division


● 4.1 Definitions

● 4.2 Fundamentals

● 4.3 Restoring Division

● 4.4 Non-Restoring Division

● 4.5 SRT Division

● 4.6 Multiplicative Division

● 4.7 Evaluation

Outline


(avoid overflow: pre-normalize B and A)

4.1 Definitions


𝑅 𝐵 ; 𝑅 𝐴 𝑚𝑜𝑑 𝐵𝐴𝐵 Q

𝑅𝐵 → A Q · 𝐵 𝑅

𝐴 ∈ 0, 2 1

𝐵, 𝑄, 𝑅 ∈ 0, 2 1 , B 0

Q 2 → 𝐵 ∈ 2 , 2 1

→ 𝐴 2 · 𝐵


4.2 Fundamentals (cont’d)

● Like paper-and-pencil division, dividend : divisor = quotient● Steps:

a) Compare left shifted divisor with dividendb) Subtract conditionally to get partial remainderc) Go to a) with partial remainder as dividend

Subtract-and-shift algorithm

● Decimal example:

Sequential, not associative,no parallelism

0,75: 0,875 750: 875 0,857 7500 70000 50000 437500 625000 6125000 1250

A B qi

Ri


4.2 Fundamentals (cont’d)

● Basic algorithm for all subtract-and-shift division algorithms

● Division methods differ in selecting qi and if redundant adders are used !

𝑞 𝑅 2 𝐵 𝑅 𝑅 𝑞 2 𝐵

𝑖 𝑛 1, … , 0

InitializationRemainder after iteration

a) b)

c)

𝑅 𝐴𝑅 𝑅

!

● e.g.:● index i:

● index i-1:

If remainder is too small for divisor, the current iteration result (𝑅 𝐵2 ) is discarded. Instead, next ‘0’ is appended (identical to shifting divisor one position lesser to MSB)

4.3 Restoring-Division


𝑞 1 iff 𝑅 𝐵2 00 iff 𝑅 𝐵2 0

𝑞 ∈ 0,1

𝑅 𝐵2 0 → 𝑞 0 ; 𝑅 𝑅

𝑅 𝐵2 0 → 𝑞 1 ; 𝑅 𝑅 𝐵2

!

! (cf. 4.4)

● Two implementation options in case 𝑞 0 :

1.

2.

Option 2 preferable: subtract in each case and restore from register ifnecessary

4.3 Restoring-Division (cont’d)


𝑅 𝑅 𝐵 · 2

𝑅 ′ 𝑅 𝐵 · 2

𝑅 𝑅 𝐵 · 2

𝑅 ′ 𝑅

save 𝑅 before subtraction

“restoring” with additional mux and register

requires addition to restore 𝑅

● index i:

● index i-1:

● Note:

Evaluate sign, subtract or add, correct by addition in next steps until partial remainder is positive again, identical red and green terms and algebraically equivalent q in 4.3 and 4.4 show identity of both methods

4.4 Non-Restoring Division


𝑞 ′1 iff 𝑅 01 iff 𝑅 0 𝑞 ′ ∈ 1, 1

𝑅 0 → 𝑞 1 ; 𝑅 𝑅 𝐵 · 2

𝑅 𝐵2 0 → 𝑞 ′ 1 ; 𝑅 𝑅 𝐵2 𝐵2 𝑅 𝐵2

𝑞 𝑞 01 𝑞 𝑞 11

!

!(cf. 4.3)

● Conversion of Q’ = (qn-1’, … q0’) to Two’s complement representation:

Q = (𝑞n 1,qn 2, … , q0, 1) Q’ is not redundant, no CPA req’d

4.4 Non-Restoring Division (cont’d)


𝑞 ∈ 1,1 → 𝑞 ∈ 0,1 → 𝑞 0 if 𝑞 1

1 if 𝑞 1

≥0≥0

≥0≥0

CPACPA

CPACPA

Q‘

Ri

A B

Correction of R

● Implementation:o For sign detection non-redundant adder is mandatoryo Last remainder needs to be corrected

Metric:𝐴 𝑛 1 · 𝐴 O(n2)...O(n2log2(n))𝑇 𝑛 1 · 𝑇 O(n2)…O(n log2(n))

CPA = RCA CLA

● Extension to signed 2’s complement division:

● Example: 2’s complement array divider (B>0, no correction of R)o XOR gates for sign evaluation

o Partial remainder Ri would tend to 0. o Note: Ri is kept in about the same range during iteration. Thus,

rounding errors are reduced.



𝑞 ′1 iff 𝑅 , 𝐵 have same sign

1 iff 𝑅 , 𝐵 have different sign 𝑞 ∈ 1, 1

𝑋𝑂𝑅 0 → different signs → 𝑅 𝑅 ⋯ 1 → identical signs → 𝑅 𝑅 ⋯

𝑅 𝑅 · 2 𝑞 · 2 · 𝐵

bi 𝑎6⊕b3

o a2, a1, a0 are fetched consecutively

● Shifted array of CAS cells (Controlled Adder/Subtractor)o XOR gates included in CAS cells



● Sweeney, Robertson, Tocher (~1958)● Use redundant adders● Problem: Fast detection of sign in redundant number, without:

o Evaluation of all digits oro Conversion to 2’s complement

● Example:o 00011𝑋𝑋 no sign detection from MSD, same problem for CS-

numbers

● Solution: Evaluate a few leading digits of partial remaindero If 0: number is small enough to assume 𝑞 0 (without

diverging iteration)o Else: similar to non-restoring division

4.5 SRT-Division


● appropriate scaling of B yields

● 3 MSDs of Ri are sufficient for determination of qi’● Nevertheless, convergence is assured ( use redundant adder instead

of CPA)● Qi’ needs conversion: SD to 2’s complement by using CPA for qi’

4.5 SRT-Division (cont’d)


2 𝐵 2𝐵 · 2 2 𝑅 2 𝐵 · 2

𝑞101

iff 𝐵 · 2 𝑅

𝐵 · 2 𝑅 𝐵 · 2 𝑅 𝐵 · 2

𝑞101

if 2 𝑅

2 𝑅 2 𝑅 2

● Implementation:

Just a little slower than array multiplication State-of-the-art division method

4.5 SRT-Division (cont’d)


Conversion to 2's comp.

CSACSA

CSA

CPA

CPA

Q

R

A B

+‐

+‐+‐

+‐

Conversion to 2's comp.

≥0

CSA+‐

≥0≥0

≥0q i

' redundant

Metric:𝐴 𝑛 · 𝐴 2𝐴 𝑂 𝑛𝑇 𝑛 · 𝑇 𝑇 𝑂 𝑛 log 𝑛

● So far o Add/sub as basic functionso Execute n times 𝑇 𝑂 𝑛o Linear convergence +1 valid bit per iteration

● Nowo Mult as basic function (Goldschmidt 1964, used in IBM 360)o Execute log 𝑛 -times 𝑇 𝑂 log 𝑛o Quadratic convergence doubles valid bits per iteration

● Algorithm

4.6 Multiplicative Division


Metric:𝐴 𝑂 𝑛 1 Mult only 𝑇 𝑂 log 𝑛

𝑄𝐴𝐵

𝐴 · 𝑅 · 𝑅 … 𝑅𝐵 · 𝑅 · 𝑅 … 𝑅

𝑄 𝐴 · 𝑅

Choose 𝑅 so thatconverges to 1

𝐵 · 𝑅 … 𝑅

● Sequential dividerso One add/sub-unit as hardwareo Low areao Low throughput

● Array dividerso n add/sub-unit as hardwareo High area, but regular designo High throughput

● Multiplicative dividerso Reuse of available multipliero Very fast for large n

4.7 Evaluation




Part 5: Elementary Functions


● 5.1 Examples and Classification of Algorithms

● 5.2 CORDICo vector rotation, generalization, architectures, redundant numbers

Outline


5.1 Examples and Classification of Algorithms


●● Some elementary functionso Logo o ex

o xy

o Sino Coso Atano Cosho ….



● ROMo 𝐴 𝑂 𝑛 · 2 o 𝑇 ? → slowo Restricted to functions with one operand and n ≤ 20..24 bito Hard to pipeline

● Polynomialo Taylor Serieso Chebyshev Series

𝐴 𝑂 𝑛 𝑇 ? better convergence less terms Hard to pipeline Excellent for software and big n



● Alternative number systemso Logarithmic systemso Residual systemso Problem: conversion between systems

● Iterationo Newton-Raphson

cf. multiplicative division)o Digit-by-Digit method

cf. Paper-and-pencil division, SRT division CORDIC

Conversion

Conversion

2's complement

Calculation

2's complement

Alternative number system


5.2 CORDIC

● COordinate Rotation DIgital Computero [Volder 1959, Walther 1971]

● Given: 𝑥 , 𝑦 , 𝜃 ● Wanted: 𝑥, 𝑦

● Use Matrix for rotation in Euclidean Space:

5.2.1 Vector Rotation

𝑥 𝑥 · cos 𝜃 𝑦 · sin 𝜃𝑦 𝑥 · sin 𝜃 𝑦 · cos 𝜃

𝜃𝜃 𝑥, 𝑦

𝑥, 𝑦 𝑥 , 𝑦


5.2.1 Vector Rotation (cont’d)

● Transformation

● Elementary rotation angle

● Iteration

𝑥 cos 𝜃 · 𝑥 𝑦 · tan 𝜃 𝐾 · 𝑥 𝑦 · tan 𝜃𝑦 cos 𝜃 · 𝑥 · tan 𝜃 𝑦 𝐾 · 𝑥 · tan 𝜃 𝑦

𝜃 arctan 2 → tan 𝜃 2

𝑥 𝐾 · 𝑥 2 · 𝑦𝑦 𝐾 · 𝑦 2 · 𝑥

𝐾1

1 tan 𝜃


5.2.2 Decomposition of Rotation in Elementary Rotation Angles

● 𝜃 ∑ 𝜎 · 𝜃 depending on direction of rotation 𝜎 ∈ 1, 1● Compute 𝜎 using successive Sub/Add (pseudo division)● Example:

i 𝜃𝒊 arctan 𝟐 𝒊

0 45.0°1 26.5°2 14.03°3 7.1°4 3.5°

Iteration i Angle Sign 𝜎𝒊

0 𝜃 𝜃 77° Positive σ 11 𝜃 77° 45° 32° Positive σ 12 𝜃 32° 26,5° 5,5° Positive σ 13 𝜃 5,5° 14,03° 8,53° Negative σ 1

𝜃 77° 45° 26,5° 14,03° 7,1° …


5.2.2 Decomposition of Rotation in Elementary Rotation Angles

● Iteration

● What about 𝐾 ?

● After n iterations (without considering 𝐾 ) vector magnitude is “stretched” by 𝐾

multiply 𝑥 and 𝑦 by known scaling factor to correct magnitude after final iteration

CSD encoding of possible

𝑧 𝜃, 𝑧 𝑧 𝜎 · arctan 2

𝜎 11 for

𝑧 0𝑧 0

Goal of iteration:

𝐾1

1 tan 𝜃1

1 2

𝑦1𝐾 · 𝑦

𝑥1𝐾 · 𝑥

𝐾 is a constant !


5.2.3 Modes of Operation: Rotation and Vectoring

● “Rotation” modeo 𝑧 𝜃, iteration goal 𝑧 → 0

● “Vectoring” modeo 𝑧 0, iteration goal 𝑦 → 0

𝑥 𝑥 𝜎 · 2 · 𝑦𝑦 𝑦 𝜎 · 2 · 𝑥𝑧 𝑧 𝜎 · arctan 2

𝜎 11 for

𝑦 0 𝑦 0

𝑥 𝑥 · cos 𝜃 𝑦 · sin 𝜃𝑦 𝑥 · sin 𝜃 𝑦 · cos 𝜃

𝜎 11 for

𝑧 0𝑧 0


5.2.4 Generalization for other Coordinate Systems [Walther]

● Vector magnitude 𝑅 𝑅 · 1 𝑚 · 𝜎 · 2

● 𝛼 , arctan 𝑚 · 2

o 𝛼 , arctan 2o 𝛼 , 2o 𝛼 , artanh 2

𝑥 𝑥 𝑚 · 𝜎 · 2 · 𝑦𝑦 𝑦 𝜎 · 2 · 𝑥𝑧 𝑧 𝜎 · 𝛼 ,

with 𝑚 1 → trigonometric circular 0 → linear

1 → hyperbolic


5.2.5 Overview of CORDIC FunctionsMode m Rotation (𝒛𝒊 → 𝟎) Vectoring (𝒚𝒊 → 𝟎)

circular𝑚 1

linear𝑚 0

hyperbolic𝑚 1

𝑥𝑦𝑧

𝑥 cos 𝑧 𝑦 sin 𝑧𝑦 cos 𝑧 𝑥 sin 𝑧0

𝑥𝑦𝑧

𝑥𝑦 𝑥 · 𝑧0

𝑥𝑦𝑧

𝑥 cosh 𝑧 𝑦 sinh 𝑧𝑦 cosh 𝑧 𝑥 sinh 𝑧0

𝑥𝑦𝑧

𝑥 𝑦0𝑧 arctan

𝑦𝑥

𝑥𝑦𝑧

𝑥0𝑧

𝑦𝑥

𝑥𝑦𝑧

𝑥 𝑦0

𝑧 artanh 𝑦𝑥

● 𝑒 , 𝑙𝑜𝑔 𝑥 , ln 𝑥 computable using angle sum identities● CORDIC provides a nearly universal method for evaluation of

elementary functions, yielding one bit accuracy per iteration


5.2.6 Architectures

● Small area, needs control logic● low throughput, no pipelining● Variable shift needed large barrel shifter

5.2.6.1 Recursive

Metric:𝐴 ≅ 3𝑛𝑇 𝑛 · log 𝑛


5.2.6.2 Pipeline

● Barrel shifter replaced by hard wiring● ROM hard wiring

o (3:1)-MUX for 3 cases of m● High throughput (times n)● High area (times n)

Metric:𝐴 ≅ 3𝑛𝑇 𝑛 · log 𝑛


5.2.6.3 Array

● Low latency● Low throughput


5.2.7 CORDIC and Redundant Number Systems

● Motivation: Avoid carry propagation during addition

● Issue 1: sign detection of redundant numberso approx. of 𝜎 by looking at 𝑝 first significant digits 𝑧 or 𝑦 (p ≪ 𝑛)o similar to SRT-Division

𝜎 ∈ 1,0, 1o But note

● Issue 2: variable scaling for 𝜎 0

𝐾1

1 𝑚 · 𝜎 · 2Until now: 𝜎 ∈ 1, 1 , 𝑚 ∈ 1,0, 1


5.2.7 CORDIC and Redundant Number Systems (cont’d)

● Solutions to avoid variable scaling when 𝜎 0o Constant scaling

Defined direction of rotation when 𝜎 0; e.g., 𝜎 1 Small error

After defined number of iterations repeat iteration E.g. 𝑝 3 → repeat each 5th iteration

Convergence guaranteed Standard method

o Double rotation Instead of rotation by 𝛼 do two rotations by ~ (→arctan(2 ))

Scaling factor is constant, double rotations twice area/timeo True variable scaling factor multiplier needed

𝜎 1 →0 →

2 · arctan 2 arctan 2 arctan 2

Documents

Selected Topics of VLSI Design - uni-rostock.de€¦ · Textbooks Parhami, B.: Computer Arithmetic, Algorithmsand Hardware Designs, 2nd edition, Oxford University Press, New York,