Distributed Arithmetic Dr Sumam David S. Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources

Distributed Arithmetic

Dr Sumam David S.

Dept. of E&C, NITK Surathkal

Courtesy for slides – Xilinx Professor’s Workshop Resources

Objective

Distributed arithmetic What ? Where ? How ?

What is DA?

Multiplication using LUT Used to implement multipliers in LUT rich

FPGAs

Twos Complement Multiplication

One bit at a time:

SDA 1-Tap FIR Filter

X0

PartialProductROM

A01

N BITS WIDESAMPLE DATA

+/- Z-1

Scaling Accumulator

LUT contains two locations

00000...0C0

A00

1

Parallelto serial converter

= Sign Extension

-23 22 21 20

C0 = 1 0 0 1 (-7)X0 = 0 1 1 1 ( 7)X

( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 01 1 0 0 1 1 1 1 (-49)

-23 22 21 20

C1 = 0 1 1 0 ( 6)X1 = 0 1 0 1 ( 5)X

0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 )0 0 0 1 1 1 1 0 ( 30)

1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1

++++

(-1)(-14)(-4)(0)(-19)

(Serial-Data / Tap-Parallel Multiply)

Distributed Arithmeticfor a 2-Tap Filter

Partial products of equal weight are added together before being summed to next higher partial product weight

Create look-up table of summed partial products


LUT contains all possible sums of the partial products

00

01

10

11

0000...0C0

C0 + C1

C1

X0

X1

A0

A1

1


Partial

Product

ROM+/- Z-1

Scaling Accumulator

0000...0C3

+


X0

0000...0C0

X1

A0

A1


0000...0C1

+

+/- Z-1

Scaling Accumulator

1

X2

0000...0C2

X3

A2

A3

1

+Partial

Product

ROM

1

SDA 8-Tap FIR FilterN BITS WIDE

SAMPLE DATA

+ +/- Z-1

Scaling Accumulator

PartialProductROM

X0

X1

A0

A11

X2

X3

A2

A3

1

1

PartialProductROM

X4

X5

A0

A11

X6

X7

A2

A3

1

1 4 -input LUT contains all possible sums of the partial products

Pre-Adder

1

fclk = 200 MHz for both processor and FPGA

B = data sample precision for FPGA

Xilinx DA FIR Performance

0 50 100 150 200 2500

1000

2000

3000

4000

5000

6000

Filter Length (Taps)

Per

form

ance

(M

MA

Cs/

s)

Serial FPGA FIR

Dual MACDA FIR B=8DA FIR B=12DA FIR B=16

10

20

30

40

50

60

Sam

ple

Rat

e (M

SP

S)

Single MAC DA FIR B=8 DA FIR B=12DA FIR B=16

0 50 100 150 200 2500

Serial FPGA FIR

Filter Length (Taps)

The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample

The sample is serialized and processed 2 bitsper clock cycle. 4 clock cycles are thus required to process the whole sample

The sample is serialized and processed 4 bits per clock cycle

The sample is processed in parallel 8 bits per clock cycle

b0 b0

b0

b3

b4

b7

b3

b4

b7

b0

b0

b7

Serial-DA Parallel-DA

Multi bits per clock cycle

Trade Clock Cyclesfor Logic Area

20Ms/s 160Ms/s

Hardware Over-sampling = 8

b0

b7

HardwareOver-sampling = 1

Trade Clock Cycles for Logic Area



Conclusion

Efficiency of computation Slow as its bit serial Memory requirements

References

The role of Distributed Arithmetic in FPGA based signal processing, www.xilinx.com

Documents

Distributed Arithmetic Dr Sumam David S. Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources