An Efﬁcient Softcore Multiplier Architecture for Xilinx FPGAsmartin-kumm.de/slides/2015_06_22_ARITH.pdf · An Efﬁcient Softcore Multiplier Architecture for Xilinx FPGAs IEEE Symposium

An Efficient Softcore Multiplier Architecture for Xilinx FPGAs

IEEE Symposium on Computer Arithmetic

22 June 2015

Martin Kumm, Shahid Abbas and Peter ZipfUniversity of Kassel, Germany

2

CONTENTS

1. State-of-the-art

2. Proposed multiplier

3. Results

WHY FPGA  SOFTCORE MULTIPLIERS?

The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks

FPGA softcore multipliers are still required:

Small word sizes (worse mapping for embedded mults)

Large word sizes ("fill gaps")

Replace embedded mults on small/low-cost FPGAs

3

Research for efficient multipliers is an ongoing process nearly since >50 years

Efficient multipliers in terms of gates may not be efficient on FPGAs

FPGA optimized structures are relatively rare

WHY THEY ARE DIFFERENT?

4


5

Xilinx slice 6/7 series

PREVIOUS WORK

01

01

01

CarryLogic

01

LUTLUTLUTLUT

A Baugh-Wooley like multiplier was proposed in  [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

6

PREVIOUS WORK

01

01

01

CarryLogic

01

LUTLUTLUTLUT

A Baugh-Wooley like multiplier was proposed in  [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

full adder

6

PREVIOUS WORKAnother idea was discussed in [Brunie 2013]:

Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4

Use a compression tree to add partial results

p =M1 + 23M2 + 26M3 + . . .

. . .+ 23M4 + 26M5 + 29M6 + . . .

. . .+ 26M7 + 29M8 + 212M9

7

BOOTH RECODING

a · b =MX

m=0m even

a · BEm2m

bm+1 bm bm�1 BEm zm cm sm

0 0 0 0 1 0 00 0 1 1 0 0 00 1 0 1 0 0 00 1 1 2 0 0 11 0 0 -2 0 1 11 0 1 -1 0 1 01 1 0 -1 0 1 01 1 1 0 1 0 0

8

9

BOOTH MULTIPLIER

c6c6 c4

c4c4c4 c2c2c2c2c2c2 c0

c0c0c0c0c0c0c0

00

0LSB

MSB

b

+=

10

BOOTH MULTIPLIER

c6c6 c4

c41 c2c21 c0

c011

00

0LSB

MSB

b

+=

PROPOSED ARCHITECTURE

01

01

CarryLogic

01

0

0 1

0 1

0 1

LUTLUTLUT

0

0 1

0 1

0 1

LUT

01

11


01

01

CarryLogic

01

0

0 1

0 1

0 1

LUTLUTLUT

0

0 1

0 1

0 1

LUT

01

11

full adder


12

RESULTSThe number of slices can be precisely predicted:     

Design was implemented as generic VHDL

A pipelined multiplier can be obtained by using the  (otherwise unused) slice FFs without much additional cost

Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012]

Xilinx Coregen was used as a commercial reference

#slices(M,N) = dN/4 + 1e| {z }slices per row

· bM/2 + 1c| {z }no of rows

13

RESULTS VIRTEX 6 COMBINATORIAL, SLICES

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

Input word size (N)

#Slices

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier

Parandeh-Afshar Multiplier

Coregen (area)

Coregen (speed)

proposed

14

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

0

20

40

60

80

Input word size (N)

Slicereduction(%)

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier


Coregen (area)

Coregen (speed)

RESULTS VIRTEX 6COMBINATORIAL, SLICE RED.

15

RESULTS VIRTEX 6 COMBINATORIAL, FREQ.

8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

100

200

300

400

500

600

700

Input word size (N)

Frequ

ency

[MHz]

1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

16

RESULTS VIRTEX 6PIPELINED, SLICES

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

Input word size (N)

#Slices

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier


Coregen (area)

Coregen (speed)

proposed

17

RESULTS VIRTEX 6PIPELINED, SLICE RED.

18

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

�10

0

10

20

30

40

50

60

70

80

Input word size (N)

Slicereduction(%)

1x4 LUT Multiplier

3x2 LUT Multiplier

3x3 LUT Multiplier


Coregen (area)

Coregen (speed)

RESULTS VIRTEX 6PIPELINED, FREQ.

8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

100

200

300

400

500

600

700

Input word size (N)

Frequ

ency

[MHz]

1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

19

UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS

20 Altera ALM

MAYBE POSSIBLE NEXT?

21

CONCLUSION

Compared to the best known design, up to

50% slices can be saved for the combinatorial multiplier

30% slices can be saved for the pipelined multiplier

Portable to FPGAs providing a 5-input LUT at one full adder input

"Free addition" supports multiply-accumulate (MAC) operation

22

LITERATURE

[Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011

[Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013

[de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012

THANK YOU!

23

BOOTH RECODING

b =bM�12M�1 + . . .+ b22

2 + b121 + b0

=bM�12M�1 + . . .+ b22

2 + 2b121 + �b121 + b0| {z }

BE0=�2b1+b0

=bM�12M�1 + . . .

. . .+ 2b323 �b323 + b222 + 2b121| {z }

BE2=(�2b3+b2+b1)22

+BE0

=MX

m=0m even

BEm2m with BEm = �2bm+1 + bm + bm�1

25


26 Altera ALM


D

FF/LATINIT1INIT0SRHISRLO

SR

CECK

D6:1

CEQ

CK SR

Q

SRHISRLOINIT1INIT0

27

Documents

An Efﬁcient Softcore Multiplier Architecture for Xilinx FPGAsmartin-kumm.de/slides/2015_06_22_ARITH.pdf · An Efﬁcient Softcore Multiplier Architecture for Xilinx FPGAs IEEE Symposium