18
Enhancing FPGA Performance Enhancing FPGA Performance for Arithmetic Circuits for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh- Afshar 1,2 1 2 University of Tehran Department of Electrical and Computer Engineering

Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Embed Size (px)

Citation preview

Page 1: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Enhancing FPGA Performance for Enhancing FPGA Performance for Arithmetic CircuitsArithmetic Circuits

Philip Brisk1

Ajay K. Verma1

Paolo Ienne1Hadi Parandeh-

Afshar1,2

12University of

TehranDepartment of Electrical and

Computer Engineering

Page 2: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

OutlineOutline

State of the Art: FPGAs

Proposed Solution

Field Programmable Counter Array (FPCA)

New Lattice for Accelerating Arithmetic Computations

Integrate on Same Die as FPGA

Experimental Results

Conclusion

Page 3: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

FPGA vs. ASICFPGA vs. ASIC

Performance

Area Utilization

Power Consumption

Flexibility

Time-to-Market

ASIC FPGA

Page 4: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

FPGA CommentaryFPGA Commentary

Poor Performance for Arithmetic Operations Compared to ASIC

IP Cores

Limited Flexibility; 18-bit Adder/Multiplier

Full Adder Implemented in CLB Structure

Fast Carry-Chain (Xilinx and Altera)Reduces Routing Delay

Cannot Use Compressor Trees to Add k>2 Values

Wallace/Dadda/3-Greedy

Page 5: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Proposed SolutionProposed Solution

1. Transform a DFG to Expose Multi-Input Addition Ops

• [Verma and Ienne, ICCAD ’04]

2. Map Addition Ops onto New Lattice (FPCA)

• Proposed Here

3. Map Everything Else onto Traditional FPGA

• Standard Approach

4. Integrate FPGA+FPCA Onto Same Die

• Ongoing Research at EPFL

Page 6: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Verma-Ienne Transformation [ICCAD Verma-Ienne Transformation [ICCAD ’04]’04]

step 3

>>

&

delta

7

&4

SEL =

0+

SEL

+

step 1

>>

&

2

=

0

SEL

+

step 2

>>

&

1

=

0

vpdiff

step 3

>>

=

delta

1

&0

step 2

>>

SEL

0

=

delta

2

&0

step 1

>>

SEL

0

=

delta

4

&0

step 0

>>

SEL

0

vpdiff

+Compressor

Tree

ADPCM

Page 7: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Proposed Hybrid LatticeProposed Hybrid Lattice

+

FPGA

FPCA

Final Adder (Programmable IP or FPGA)

FPCA : Field Programmable Counter Array• Novel Lattice for Accelerating Large Sums

Page 8: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

CountersCounters

m

n

m:n counter

n = log2(m+1)

Count #of Input Bits Set to 1

Output # as a Binary Value

Counters You Know

2:2 – Half Adder

3:2 – Full Adder(Carry-Save Adder)

The correct building block for computing sums of k>2 numbers

Better than LUTs!

Page 9: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Field Programmable Counter Array Field Programmable Counter Array (FPCA)(FPCA)

Same Structure as an FPGAReplace CLBs with Counters

Integrate onto Same Die as FPGA

FPGA: (CLB) FPCA: (Counter)

Page 10: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Experimental MethodologyExperimental Methodology

Xilinx Virtex-4, Altera Stratix-II, With/Without FPCA90nm CMOS Technology

For Multi-Input Addition OpsFPGA – Adder Tree

Binary Adders in Virtex-4Ternary Adders in Stratix-II

FPCA – Build Compressor Trees From Counters Use Modified Wallace AlgorithmPlace-and-Route Using VPRUse FPGA for Final Addition

Page 11: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Experimental ResultsExperimental Results

H.264/AVC Motion Estimation - Delay

0123456789

10

Virtex-4 FPCA(8:4)

FPCA(12:4)

FPCA(16:5)

FPCA(20:5)

Stratix-II FPCA(8:4)

FPCA(12:4)

FPCA(16:5)

FPCA(20:5)

Logic Delay (ns) Sum Delay (ns)

Delay (ns)

Page 12: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Experimental ResultsExperimental Results

H.264/AVC Motion Estimation - Area Utilization

0

10

20

30

40

50

60

70

80

Virtex-4 Virtex-4+FPCA Stratix-II Stratix-II+FPCA

Counters

CLBs

Page 13: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Experimental ResultsExperimental Results

FIR Filter - Delay

0

1

2

3

4

5

6

6-Tap 10-Tap 20-Tap

Virtex-4 w/FPCA (8:4) w/FPCA (12:4) Stratix-II w/FPCA (8:4) w/FPCA (12:4)

Delay (ns)

Virtex-4 Stratix-II

Virtex-4 Stratix-II

Virtex-4 Stratix-II

Page 14: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

FIR Filter - Pipeline Stages

0

2

4

6

8

10

12

6-Tap 10-Tap 20-Tap

Virtex-4 w/FPCA (8:4) w/FPCA (12:4) Stratix-II w/FPCA (8:4) w/FPCA (12:4)

Experimental ResultsExperimental Results

Virtex-4 Stratix-II

Virtex-4 Stratix-II

Virtex-4 Stratix-II

FPCA – Register Placed on Every Counter Output

Page 15: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Experimental ResultsExperimental Results

6-Tap FIR Filter - Area Utilization

0

100

200

300

400

500

600

700

800

900

1000

Virtex-4 Virtex-4 FPCA(8:4)

Virtex-4 FPCA(12:4)

Stratix-II Stratix-II FPCA(8:4)

Stratix-II FPCA(12:4)

Counters

CLBs

Page 16: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Experimental ResultsExperimental Results

10-Tap FIR Filter - Area Utilization

0

200

400

600

800

1000

1200

1400

1600

Virtex-4 Virtex-4 FPCA (8:4)

Virtex-4 FPCA (12:4)

Stratix-II Stratix-II FPCA (8:4)

Stratix-II FPCA (12:4)

Counters

CLBs

Page 17: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

Experimental ResultsExperimental Results

20-Tap FIR Filter - Area Utilization

0

500

1000

1500

2000

2500

3000

3500

Virtex-4 Virtex-4 FPCA (8:4)

Virtex-4 FPCA (12:4)

Stratix-II Stratix-II FPCA (8:4)

Stratix-II FPCA (12:4)

Counters

CLBs

Page 18: Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department

ConclusionConclusion

FPGA Performance for Arithmetic Circuits is Lacking

Hybrid FPGA/FPCA Accelerates Arithmetic Circuits

Significant Improvement in Area Utilization

Counters are the Correct Building Blocks for Multi-Input Additions

Marginal Improvements in Delay

FPGA – Fast Carry-Chain (No Routing Delay)

FPCA – All Wires Having Routing Delays

Naïve/No Retiming in These Experiments