Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
An Efficient Softcore Multiplier Architecture for Xilinx FPGAs
IEEE Symposium on Computer Arithmetic
22 June 2015
Martin Kumm, Shahid Abbas and Peter ZipfUniversity of Kassel, Germany
2
CONTENTS
1. State-of-the-art
2. Proposed multiplier
3. Results
WHY FPGA SOFTCORE MULTIPLIERS?
The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks
FPGA softcore multipliers are still required:
Small word sizes (worse mapping for embedded mults)
Large word sizes ("fill gaps")
Replace embedded mults on small/low-cost FPGAs
3
Research for efficient multipliers is an ongoing process nearly since >50 years
Efficient multipliers in terms of gates may not be efficient on FPGAs
FPGA optimized structures are relatively rare
WHY THEY ARE DIFFERENT?
4
WHY THEY ARE DIFFERENT?
5
Xilinx slice 6/7 series
PREVIOUS WORK
01
01
01
CarryLogic
01
LUTLUTLUTLUT
A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
6
PREVIOUS WORK
01
01
01
CarryLogic
01
LUTLUTLUTLUT
A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
full adder
6
PREVIOUS WORKAnother idea was discussed in [Brunie 2013]:
Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4
Use a compression tree to add partial results
p =M1 + 23M2 + 26M3 + . . .
. . .+ 23M4 + 26M5 + 29M6 + . . .
. . .+ 26M7 + 29M8 + 212M9
7
BOOTH RECODING
a · b =MX
m=0m even
a · BEm2m
bm+1 bm bm�1 BEm zm cm sm
0 0 0 0 1 0 00 0 1 1 0 0 00 1 0 1 0 0 00 1 1 2 0 0 11 0 0 -2 0 1 11 0 1 -1 0 1 01 1 0 -1 0 1 01 1 1 0 1 0 0
8
9
BOOTH MULTIPLIER
c6c6 c4
c4c4c4 c2c2c2c2c2c2 c0
c0c0c0c0c0c0c0
00
0LSB
MSB
b
+=
10
BOOTH MULTIPLIER
c6c6 c4
c41 c2c21 c0
c011
00
0LSB
MSB
b
+=
PROPOSED ARCHITECTURE
01
01
CarryLogic
01
0
0 1
0 1
0 1
LUTLUTLUT
0
0 1
0 1
0 1
LUT
01
11
PROPOSED ARCHITECTURE
01
01
CarryLogic
01
0
0 1
0 1
0 1
LUTLUTLUT
0
0 1
0 1
0 1
LUT
01
11
full adder
PROPOSED ARCHITECTURE
12
RESULTSThe number of slices can be precisely predicted:
Design was implemented as generic VHDL
A pipelined multiplier can be obtained by using the (otherwise unused) slice FFs without much additional cost
Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012]
Xilinx Coregen was used as a commercial reference
#slices(M,N) = dN/4 + 1e| {z }slices per row
· bM/2 + 1c| {z }no of rows
13
RESULTS VIRTEX 6 COMBINATORIAL, SLICES
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
Input word size (N)
#Slices
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
14
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
0
20
40
60
80
Input word size (N)
Slicereduction(%)
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
RESULTS VIRTEX 6COMBINATORIAL, SLICE RED.
15
RESULTS VIRTEX 6 COMBINATORIAL, FREQ.
8 12 16 20 24 28 32 36 40 44 48 52 56 60 640
100
200
300
400
500
600
700
Input word size (N)
Frequ
ency
[MHz]
1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed
16
RESULTS VIRTEX 6PIPELINED, SLICES
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
Input word size (N)
#Slices
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
17
RESULTS VIRTEX 6PIPELINED, SLICE RED.
18
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
�10
0
10
20
30
40
50
60
70
80
Input word size (N)
Slicereduction(%)
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
RESULTS VIRTEX 6PIPELINED, FREQ.
8 12 16 20 24 28 32 36 40 44 48 52 56 60 640
100
200
300
400
500
600
700
Input word size (N)
Frequ
ency
[MHz]
1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed
19
UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS
20 Altera ALM
MAYBE POSSIBLE NEXT?
21
CONCLUSION
Compared to the best known design, up to
50% slices can be saved for the combinatorial multiplier
30% slices can be saved for the pipelined multiplier
Portable to FPGAs providing a 5-input LUT at one full adder input
"Free addition" supports multiply-accumulate (MAC) operation
22
LITERATURE
[Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011
[Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013
[de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012
THANK YOU!
23
BOOTH RECODING
b =bM�12M�1 + . . .+ b22
2 + b121 + b0
=bM�12M�1 + . . .+ b22
2 + 2b121 + �b121 + b0| {z }
BE0=�2b1+b0
=bM�12M�1 + . . .
. . .+ 2b323 �b323 + b222 + 2b121| {z }
BE2=(�2b3+b2+b1)22
+BE0
=MX
m=0m even
BEm2m with BEm = �2bm+1 + bm + bm�1
25
WHY THEY ARE DIFFERENT?
26 Altera ALM
WHY THEY ARE DIFFERENT?
D
FF/LATINIT1INIT0SRHISRLO
SR
CECK
D6:1
CEQ
CK SR
Q
SRHISRLOINIT1INIT0
27