29
An Efficient Softcore Multiplier Architecture for Xilinx FPGAs IEEE Symposium on Computer Arithmetic 22 June 2015 Martin Kumm, Shahid Abbas and Peter Zipf University of Kassel, Germany

An Efficient Softcore Multiplier Architecture for Xilinx FPGAsmartin-kumm.de/slides/2015_06_22_ARITH.pdf · An Efficient Softcore Multiplier Architecture for Xilinx FPGAs IEEE Symposium

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • An Efficient Softcore Multiplier Architecture for Xilinx FPGAs

    IEEE Symposium on Computer Arithmetic

    22 June 2015

    Martin Kumm, Shahid Abbas and Peter ZipfUniversity of Kassel, Germany

  • 2

    CONTENTS

    1. State-of-the-art

    2. Proposed multiplier

    3. Results

  • WHY FPGA 
SOFTCORE MULTIPLIERS?

    The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks

    FPGA softcore multipliers are still required:

    Small word sizes (worse mapping for embedded mults)

    Large word sizes ("fill gaps")

    Replace embedded mults on small/low-cost FPGAs

    3

  • Research for efficient multipliers is an ongoing process nearly since >50 years

    Efficient multipliers in terms of gates may not be efficient on FPGAs

    FPGA optimized structures are relatively rare

    WHY THEY ARE DIFFERENT?

    4

  • WHY THEY ARE DIFFERENT?

    5

    Xilinx slice 6/7 series

  • PREVIOUS WORK

    01

    01

    01

    CarryLogic

    01

    LUTLUTLUTLUT

    A Baugh-Wooley like multiplier was proposed in 
[Parandeh-Afshar 2011]

    Two partial products are generated and added using carry chain

    Compression tree of already reduced PP's necessary

    6

  • PREVIOUS WORK

    01

    01

    01

    CarryLogic

    01

    LUTLUTLUTLUT

    A Baugh-Wooley like multiplier was proposed in 
[Parandeh-Afshar 2011]

    Two partial products are generated and added using carry chain

    Compression tree of already reduced PP's necessary

    full adder

    6

  • PREVIOUS WORKAnother idea was discussed in [Brunie 2013]:

    Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4

    Use a compression tree to add partial results

    p =M1 + 23M2 + 26M3 + . . .

    . . .+ 23M4 + 26M5 + 29M6 + . . .

    . . .+ 26M7 + 29M8 + 212M9

    7

  • BOOTH RECODING

    a · b =MX

    m=0m even

    a · BEm2m

    bm+1 bm bm�1 BEm zm cm sm

    0 0 0 0 1 0 00 0 1 1 0 0 00 1 0 1 0 0 00 1 1 2 0 0 11 0 0 -2 0 1 11 0 1 -1 0 1 01 1 0 -1 0 1 01 1 1 0 1 0 0

    8

  • 9

    BOOTH MULTIPLIER

    c6c6 c4

    c4c4c4 c2c2c2c2c2c2 c0

    c0c0c0c0c0c0c0

    00

    0LSB

    MSB

    b

    +=

  • 10

    BOOTH MULTIPLIER

    c6c6 c4

    c41 c2c21 c0

    c011

    00

    0LSB

    MSB

    b

    +=

  • PROPOSED ARCHITECTURE

    01

    01

    CarryLogic

    01

    0

    0 1

    0 1

    0 1

    LUTLUTLUT

    0

    0 1

    0 1

    0 1

    LUT

    01

    11

  • PROPOSED ARCHITECTURE

    01

    01

    CarryLogic

    01

    0

    0 1

    0 1

    0 1

    LUTLUTLUT

    0

    0 1

    0 1

    0 1

    LUT

    01

    11

    full adder

  • PROPOSED ARCHITECTURE

    12

  • RESULTSThe number of slices can be precisely predicted: 

 


    Design was implemented as generic VHDL

    A pipelined multiplier can be obtained by using the 
(otherwise unused) slice FFs without much additional cost

    Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012]

    Xilinx Coregen was used as a commercial reference

    #slices(M,N) = dN/4 + 1e| {z }slices per row

    · bM/2 + 1c| {z }no of rows

    13

  • RESULTS VIRTEX 6 COMBINATORIAL, SLICES

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    0

    200

    400

    600

    800

    1,000

    1,200

    1,400

    1,600

    1,800

    2,000

    Input word size (N)

    #Slices

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

    proposed

    14

  • 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    0

    20

    40

    60

    80

    Input word size (N)

    Slicereduction(%)

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

    RESULTS VIRTEX 6COMBINATORIAL, SLICE RED.

    15

  • RESULTS VIRTEX 6 COMBINATORIAL, FREQ.

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

    100

    200

    300

    400

    500

    600

    700

    Input word size (N)

    Frequ

    ency

    [MHz]

    1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

    16

  • RESULTS VIRTEX 6PIPELINED, SLICES

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    0

    200

    400

    600

    800

    1,000

    1,200

    1,400

    1,600

    1,800

    2,000

    Input word size (N)

    #Slices

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

    proposed

    17

  • RESULTS VIRTEX 6PIPELINED, SLICE RED.

    18

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    �10

    0

    10

    20

    30

    40

    50

    60

    70

    80

    Input word size (N)

    Slicereduction(%)

    1x4 LUT Multiplier

    3x2 LUT Multiplier

    3x3 LUT Multiplier

    Parandeh-Afshar Multiplier

    Coregen (area)

    Coregen (speed)

  • RESULTS VIRTEX 6PIPELINED, FREQ.

    8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

    100

    200

    300

    400

    500

    600

    700

    Input word size (N)

    Frequ

    ency

    [MHz]

    1x4 LUT Multiplier3x2 LUT Multiplier3x3 LUT MultiplierParandeh-Afshar MultiplierCoregen (area)Coregen (speed)proposed

    19

  • UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS

    20 Altera ALM

  • MAYBE POSSIBLE NEXT?

    21

  • CONCLUSION

    Compared to the best known design, up to

    50% slices can be saved for the combinatorial multiplier

    30% slices can be saved for the pipelined multiplier

    Portable to FPGAs providing a 5-input LUT at one full adder input

    "Free addition" supports multiply-accumulate (MAC) operation

    22

  • LITERATURE

    [Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011

    [Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013

    [de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012

    THANK YOU!

    23

  • BOOTH RECODING

    b =bM�12M�1 + . . .+ b22

    2 + b121 + b0

    =bM�12M�1 + . . .+ b22

    2 + 2b121 + �b121 + b0| {z }

    BE0=�2b1+b0

    =bM�12M�1 + . . .

    . . .+ 2b323 �b323 + b222 + 2b121| {z }

    BE2=(�2b3+b2+b1)22

    +BE0

    =MX

    m=0m even

    BEm2m with BEm = �2bm+1 + bm + bm�1

    25

  • WHY THEY ARE DIFFERENT?

    26 Altera ALM

  • WHY THEY ARE DIFFERENT?

    D

    FF/LATINIT1INIT0SRHISRLO

    SR

    CECK

    D6:1

    CEQ

    CK SR

    Q

    SRHISRLOINIT1INIT0

    27