29
EE 382 Processor Design Winter 98/99 Michael Flynn 1 AT Arithmetic • Most concern has gone into creating fast implementation of (especially) FP Arith. • Under the AT (area-time) rule, area is (almost) as important. • So it’s important to know the latency, bandwidth and area that any particular algorithm requires.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 1

AT Arithmetic

• Most concern has gone into creating fast implementation of (especially) FP Arith.

• Under the AT (area-time) rule, area is (almost) as important.

• So it’s important to know the latency, bandwidth and area that any particular algorithm requires.

Page 2: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 2

Integer addition

• Adders are the fundamental building block of the processor, defining t.

• Adder types include– carry chain, carry select (conditional sum),

carry lookahead (Brent-Kung), canonic (prefix) carry skip, Ling

• Most high speed 32b adders take about the same area (f normalized)…1 A to 1.5A

Page 3: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 3

Integer addition

• Both area and time scale as n, the adder precision. The delay, t, scales slowly (log n)

• Area scale about linearly with n; so a 64b adder takes 2-3 A, but still fits into t …maybe by definition of a “cycle”.

Page 4: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 4

Carry skip adder

Page 5: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 5

Manchester carry chain

Page 6: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 6

Carry skip logic

Page 7: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 7

Carry select addition

Page 8: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 8

FP addition

• A basic FP adder has 5 steps– exponent difference, pre align, significand add,

post align, and round.

• Assuming that a full shifter has about the same complexity (delay and area) as an add, then 64b FP addition takes 7 - 10 A, and has about 5 t execution

Page 9: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 9

FP additionAdvanced FP adders are faster and use more area:1) Two path FADD creates separate paths for operands;

• a path for operands whose exponents close in value (subtract) this is the only case when we need a full shift to renormalize the result

• a path for other cases where the exponent difference is > 2(this is the only case that uses a full shift to prealign significands)

2) A FADD with integrated rounding. Here the rounding step is eliminated by computing both the sum/difference and the result plus 1… this is done by using 2 adders (or a compound adder) and then MUXing out the final result.

Page 10: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 10

FP adders

• The two path FP adder uses an additional significand adder and exponent adder… about 3-4 A. It reduces FADD delay by one t

• Integrated rounding adds another rounding adder plus MUX…another 3-4 A while reducing delay by another t

Page 11: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 11

FP adders

• Net area time tradeoff

• Basic… Area 10 A and delay 4-5 t• Two path… Area 13.5 A and delay 3-4 t• Integrated round (with two paths)… area

17 A and delay 2-3 t• For pipelining add 1 A per pipe stage and

use upper range on t

Page 12: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 12

Multipliers

• After add, the most important arithmetic op

• Approaches– encode the multiplier bits (Booth 2, Booth 3...)– assimilate the partial products

• one, two or n pass (iterated arrays or trees)• arrays (simple, double, higher level)• trees (Wallace, binary[4:2], ZD,….)

– CPA to produce product

Page 13: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 13

Multipliers

• Integer and FP multipliers usually have about the same execution time (with same precision, n)

• Booth reduces number of pp’s but adds MUXs to generate the pp’s.

• Most of the area, and probably delay too, is in the pp reduction tree.

Page 14: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 14

16 bit Booth 2 multiply

Page 15: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 15

16 bit Booth 2 example

Page 16: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 16

16 bit Booth 2 pp selector logic

Page 17: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 17

16 bit Booth 3 multiply

Page 18: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 18

5 x 5 unsigned multiplication

Page 19: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 19

1-bit adder

Page 20: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 20

Wallace tree

Page 21: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 21

Wallace tree reduction

Page 22: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 22

Multipliers• A full tree implementation of a 54b (FP

type) with Booth 2 has tree height 28 and uses about 2500 CSAs (or about 50 A in the tree). Maybe a total of 10 A in MUXs plus 50 A in tree and 3A in the CPA, 62A total.The fastest multiplier is, maybe, 2 t

• Using a 2 pass tree reduces the hardware considerably; height is 14 using about 700 CSAs or 14 A…total area 5 + 14 + 3 = 22A; 3-4 t

Page 23: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 23

Multipliers

• To pipeline the Multiplier we need a full tree implementation; probably 3-4 t.

• Perhaps Booth3, followed by a full tree (h = 17) and CPA stage.

• Probably area = 50 - 60A

Page 24: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 24

Divide

• Infrequent op, but long latency can affect IPC achieved.

• Algorithms:– SRT 2 or 3 bit (32 - 36 t) maybe 6-10 A– NR or Binomial expansion (10- 14t); needs at

least 6 A for table and control plus use of MPY– Bipartite tables for small n (less than 24b)

Page 25: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 25

Divide

SRT creates quotient 2 or 3 bits/iteration– uses divisor - partial remainder lookup table for

trial quotient then subtracts– result (partial rem.) is in redundant form so no

restoration is needed; also result is left as a sum and carry pair (no cpa needed)

– fast iteration is possible, sometimes 2x per t

Page 26: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 26

Divide

Multiply based use either Newton Raphson or Binomial series– if f(x) = b - 1/x; root is at x = 1/b then NR

iteration is xi+1 = xi (2 b xi )

– converges is quadratic, doubles precision of result each iteration

– so start with table lookup of 1/b to 8b, then 3 iterations gives 64b result then a x (1/b) is quotient

Page 27: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 27

Divide

• Divide is not usually pipelined, except for small n implementations.

• Frequently combined with square root in the same implementation.

Page 28: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 28

Sub word concurrency

• Provides 8, 16, 32b concurrent ops within “existing” integer or FP hardware

• In 64b integer unit can do 8x8, or 4x16, or 2x32 ops concurrently

• Since FP units are designed to be faster, may be use it: 8x4, or 2x16, or 2x24.

Page 29: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under

EE 382 Processor Design Winter 98/99 Michael Flynn 29

Sub word concurrency

• Usually only for add and multiply

• Implementations straightforward for add; more complicated for multiply– requires reorganizing partitions of the pp tree– affects multiply area and delay marginally

(maybe 10% delay and 20% area)

• isa must define “saturating” arithmetic.