19
A Flexible DSP Block to Enhance FGPA Arithmetic Performance Hadi Parandeh-Afshar Alessandro Cevrero Panagiotis Athanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne Ecole Politechique Federale De lausanne (EPFL) University of California Riverside (UCR) {first_name.last_name@epfl.ch} [email protected] LAP EPFL LSM, LAP EPFL LSM, LAP EPFL UCR LSM EPFL LAP EPFL

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

  • Upload
    rich

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

A Flexible DSP Block to Enhance FGPA Arithmetic Performance. Hadi Parandeh-Afshar Alessandro Cevrero Panagiotis Athanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne. LAP EPFL LSM, LAP EPFL LSM, LAP EPFL UCR LSM EPFL LAP EPFL. - PowerPoint PPT Presentation

Citation preview

Page 1: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Hadi Parandeh-Afshar Alessandro CevreroPanagiotis Athanasopoulous Philip BriskYusuf LeblebiciPaolo Ienne

Ecole Politechique Federale De lausanne (EPFL)University of California Riverside (UCR)

{[email protected]}[email protected]

LAP EPFLLSM, LAP EPFLLSM, LAP EPFL UCRLSM EPFLLAP EPFL

Page 2: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Motivation and contribution

New DSP block for high performance FPGAs Increased flexibility

Enchance FPGA arithmetic performance

Programmable Compressor

Tree

Programmable Compressor

Tree

PPGPPG

Bypassable PPG

Page 3: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Motivation and contribution

Data flow transformation automatically expose compressor tree

19

E1 E2M1 M2

1948

4

S1 S2

out

sign

xor

negS1 S2

xor

E1 E2

19 19

M2M1

48 1

4

out

not

sign

andFused multiply-addition operations cannot use current DSP blocks in a

single-cycle

Arithmetic transformations

E1 E2

DSP blocks cannot accelerate multi-operand addition

(a) (b)

[Verma et al , TCAD 08]

Page 4: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Outline

Related work Limitations

DSP Block Architecture

Experimental methodology

Results

Conclusions

Page 5: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

FPGA commentary Logic cells with dedicated addition circuitry and fast carry

chains Compressor tree synthesis on 6-LUT FPGAs

[Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09]

IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV]

Σ

9 9

9 9

9 9

9 9

Page 6: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

FPGA commentary Logic cells with dedicated addition circuitry and fast carry

chains Compressor tree synthesis on 6 LUTs FPGAs

[Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09]

IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV]

Σ

9 9

9 9

9 9

9 9

Page 7: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Field Programmable Compressor Tree (FPCT)

User-configurable multi operand adder Compressor tree + bypassable CPA

15

16

15

CSlice

6

128 = 816 input bits

48 = 86 output bits

Carry-in

1515

Carry-out

[Cevrero et al, FPGA 08, TRETS 09]

Page 8: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

FPCT limitations

PPG soft logic

Soft-Logic 9x9-bit PPG (81 LUTs)

82 wires

1

FPCT

18 bit output

9x9-bit signed multiplier [Baugh Wooley]

Page 9: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

FPCT limitations

PPG soft logic Low input utilization for multipliers

Soft-Logic 9x9-bit PPG (81 LUTs)

82 wires

1

FPCT

18 bit output

9x9-bit signed multiplier [Baugh Wooley]

222 2 333

C0C1C2C3C4C5C6

64% input utilization

Page 10: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

DSP block architecture

4 11

FPCT(8 CSlices)

128

48

Page 11: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

½-FPCT(4 CSlices)

DSP block architecture

4

½-FPCT(4 CSlices)

AA

BBB

PPGPPG*

55

61

21

15

3

0

3

0

9018

128

11

61

6

Two 9x9 signed PPGs One modified to support larger multiplier

Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

Page 12: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

½-FPCT(4 CSlices)

DSP block architecture

4

½-FPCT(4 CSlices)

AA

BBB

PPGPPG*

55

61

21

15

3

0

3

0

9018

128

11

61

6

Two 9x9 signed PPGs One modified to support larger multiplier

Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

522233

Fixed

Logic (A)

Fixed

Logic (B)

C1C2C3C4

Page 13: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

½-FPCT(4 CSlices)

DSP block architecture

4

½-FPCT(4 CSlices)

AA

BBB

PPGPPG*

55

61

21

15

3

0

3

0

9018

128

11

61

6

Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process)

Two 9x9 signed PPGs One modified to support larger multiplier

Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

Page 14: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Experimental methodology

Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F*

Same area and I/0 as our DSP

Input Pins

Output Pins

IP

IP

IP

Page 15: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Experimental methodologyInput Pins

Output Pins

F*

F*

F*

Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F*

Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay

Estimated proposed DSP block delay ASIC design flow (90nm CMOS)

Page 16: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Experimental methodologyInput Pins

Output Pins

New-DPS

New-DPS

New-DPS

Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F*

Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay

Estimated proposed DSP block delay ASIC design flow (90nm CMOS)

For each proposed DSP block in the circuit Subtract delay of F* Add proposed DSP block delay

Page 17: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Results

ns

Critical Path Delay

Ternary

GPC [Parandeh-Afshar et al, ASPDAC 08]

Stratix II DSP Block

FPCT w/ Soft PPG

Proposed DSP Block

0

2

4

6

8

10

12

m9x9 m10x10 m12x12 m18x18 m20x20

Page 18: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Results

0

1

2

3

4

5

6

7

8

9

m9x9 m10x10 m12x12 m18x18 m20x20

Stratix II DSP Block

FPCT w/ Soft PPG

Proposed DSP Block

Normalized Area (to Stratix II DSP block area)

Page 19: A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Conclusion

New DSP block proposed Accelerate multiplication and multi-operand addition

More flexibility Competitive with Stratix II DSP block

Intends to replace compressor tree in existing DSP block

Only 8% area overhead respect to original FPCT