View
220
Download
3
Category
Tags:
Preview:
Citation preview
Distributed Arithmetic
Dr Sumam David S.
Dept. of E&C, NITK Surathkal
Courtesy for slides – Xilinx Professor’s Workshop Resources
Objective
Distributed arithmetic What ? Where ? How ?
What is DA?
Multiplication using LUT Used to implement multipliers in LUT rich
FPGAs
Twos Complement Multiplication
One bit at a time:
SDA 1-Tap FIR Filter
X0
PartialProductROM
A01
N BITS WIDESAMPLE DATA
+/- Z-1
Scaling Accumulator
LUT contains two locations
00000...0C0
A00
1
Parallelto serial converter
= Sign Extension
-23 22 21 20
C0 = 1 0 0 1 (-7)X0 = 0 1 1 1 ( 7)X
( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 01 1 0 0 1 1 1 1 (-49)
-23 22 21 20
C1 = 0 1 1 0 ( 6)X1 = 0 1 0 1 ( 5)X
0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 )0 0 0 1 1 1 1 0 ( 30)
1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1
++++
(-1)(-14)(-4)(0)(-19)
(Serial-Data / Tap-Parallel Multiply)
Distributed Arithmeticfor a 2-Tap Filter
Partial products of equal weight are added together before being summed to next higher partial product weight
Create look-up table of summed partial products
SDA 2-Tap FIR Filter
LUT contains all possible sums of the partial products
00
01
10
11
0000...0C0
C0 + C1
C1
X0
X1
A0
A1
1
N BITS WIDESAMPLE DATA
Partial
Product
ROM+/- Z-1
Scaling Accumulator
0000...0C3
+
SDA 4-Tap FIR Filter
X0
0000...0C0
X1
A0
A1
N BITS WIDESAMPLE DATA
0000...0C1
+
+/- Z-1
Scaling Accumulator
1
X2
0000...0C2
X3
A2
A3
1
+Partial
Product
ROM
1
SDA 8-Tap FIR FilterN BITS WIDE
SAMPLE DATA
+ +/- Z-1
Scaling Accumulator
PartialProductROM
X0
X1
A0
A11
X2
X3
A2
A3
1
1
PartialProductROM
X4
X5
A0
A11
X6
X7
A2
A3
1
1 4 -input LUT contains all possible sums of the partial products
Pre-Adder
1
fclk = 200 MHz for both processor and FPGA
B = data sample precision for FPGA
Xilinx DA FIR Performance
0 50 100 150 200 2500
1000
2000
3000
4000
5000
6000
Filter Length (Taps)
Per
form
ance
(M
MA
Cs/
s)
Serial FPGA FIR
Dual MACDA FIR B=8DA FIR B=12DA FIR B=16
10
20
30
40
50
60
Sam
ple
Rat
e (M
SP
S)
Single MAC DA FIR B=8 DA FIR B=12DA FIR B=16
0 50 100 150 200 2500
Serial FPGA FIR
Filter Length (Taps)
The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample
The sample is serialized and processed 2 bitsper clock cycle. 4 clock cycles are thus required to process the whole sample
The sample is serialized and processed 4 bits per clock cycle
The sample is processed in parallel 8 bits per clock cycle
b0 b0
b0
b3
b4
b7
b3
b4
b7
b0
b0
b7
Serial-DA Parallel-DA
Multi bits per clock cycle
Trade Clock Cyclesfor Logic Area
20Ms/s 160Ms/s
Hardware Over-sampling = 8
b0
b7
HardwareOver-sampling = 1
Trade Clock Cycles for Logic Area
Hardware Over-sampling = 4
Hardware Over-sampling = 2
Conclusion
Efficiency of computation Slow as its bit serial Memory requirements
References
The role of Distributed Arithmetic in FPGA based signal processing, www.xilinx.com
Recommended