Upload
lesley-briggs
View
214
Download
0
Embed Size (px)
Citation preview
Enhancing FPGA Performance for Enhancing FPGA Performance for Arithmetic CircuitsArithmetic Circuits
Philip Brisk1
Ajay K. Verma1
Paolo Ienne1Hadi Parandeh-
Afshar1,2
12University of
TehranDepartment of Electrical and
Computer Engineering
OutlineOutline
State of the Art: FPGAs
Proposed Solution
Field Programmable Counter Array (FPCA)
New Lattice for Accelerating Arithmetic Computations
Integrate on Same Die as FPGA
Experimental Results
Conclusion
FPGA vs. ASICFPGA vs. ASIC
Performance
Area Utilization
Power Consumption
Flexibility
Time-to-Market
ASIC FPGA
√
√
√
√
√
FPGA CommentaryFPGA Commentary
Poor Performance for Arithmetic Operations Compared to ASIC
IP Cores
Limited Flexibility; 18-bit Adder/Multiplier
Full Adder Implemented in CLB Structure
Fast Carry-Chain (Xilinx and Altera)Reduces Routing Delay
Cannot Use Compressor Trees to Add k>2 Values
Wallace/Dadda/3-Greedy
Proposed SolutionProposed Solution
1. Transform a DFG to Expose Multi-Input Addition Ops
• [Verma and Ienne, ICCAD ’04]
2. Map Addition Ops onto New Lattice (FPCA)
• Proposed Here
3. Map Everything Else onto Traditional FPGA
• Standard Approach
4. Integrate FPGA+FPCA Onto Same Die
• Ongoing Research at EPFL
Verma-Ienne Transformation [ICCAD Verma-Ienne Transformation [ICCAD ’04]’04]
step 3
>>
&
delta
7
&4
SEL =
0+
SEL
+
step 1
>>
&
2
=
0
SEL
+
step 2
>>
&
1
=
0
vpdiff
step 3
>>
=
delta
1
&0
step 2
>>
SEL
0
=
delta
2
&0
step 1
>>
SEL
0
=
delta
4
&0
step 0
>>
SEL
0
vpdiff
∑
+Compressor
Tree
ADPCM
Proposed Hybrid LatticeProposed Hybrid Lattice
∑
+
FPGA
FPCA
Final Adder (Programmable IP or FPGA)
FPCA : Field Programmable Counter Array• Novel Lattice for Accelerating Large Sums
CountersCounters
m
n
m:n counter
n = log2(m+1)
Count #of Input Bits Set to 1
Output # as a Binary Value
Counters You Know
2:2 – Half Adder
3:2 – Full Adder(Carry-Save Adder)
The correct building block for computing sums of k>2 numbers
Better than LUTs!
Field Programmable Counter Array Field Programmable Counter Array (FPCA)(FPCA)
Same Structure as an FPGAReplace CLBs with Counters
Integrate onto Same Die as FPGA
FPGA: (CLB) FPCA: (Counter)
Experimental MethodologyExperimental Methodology
Xilinx Virtex-4, Altera Stratix-II, With/Without FPCA90nm CMOS Technology
For Multi-Input Addition OpsFPGA – Adder Tree
Binary Adders in Virtex-4Ternary Adders in Stratix-II
FPCA – Build Compressor Trees From Counters Use Modified Wallace AlgorithmPlace-and-Route Using VPRUse FPGA for Final Addition
Experimental ResultsExperimental Results
H.264/AVC Motion Estimation - Delay
0123456789
10
Virtex-4 FPCA(8:4)
FPCA(12:4)
FPCA(16:5)
FPCA(20:5)
Stratix-II FPCA(8:4)
FPCA(12:4)
FPCA(16:5)
FPCA(20:5)
Logic Delay (ns) Sum Delay (ns)
Delay (ns)
Experimental ResultsExperimental Results
H.264/AVC Motion Estimation - Area Utilization
0
10
20
30
40
50
60
70
80
Virtex-4 Virtex-4+FPCA Stratix-II Stratix-II+FPCA
Counters
CLBs
Experimental ResultsExperimental Results
FIR Filter - Delay
0
1
2
3
4
5
6
6-Tap 10-Tap 20-Tap
Virtex-4 w/FPCA (8:4) w/FPCA (12:4) Stratix-II w/FPCA (8:4) w/FPCA (12:4)
Delay (ns)
Virtex-4 Stratix-II
Virtex-4 Stratix-II
Virtex-4 Stratix-II
FIR Filter - Pipeline Stages
0
2
4
6
8
10
12
6-Tap 10-Tap 20-Tap
Virtex-4 w/FPCA (8:4) w/FPCA (12:4) Stratix-II w/FPCA (8:4) w/FPCA (12:4)
Experimental ResultsExperimental Results
Virtex-4 Stratix-II
Virtex-4 Stratix-II
Virtex-4 Stratix-II
FPCA – Register Placed on Every Counter Output
Experimental ResultsExperimental Results
6-Tap FIR Filter - Area Utilization
0
100
200
300
400
500
600
700
800
900
1000
Virtex-4 Virtex-4 FPCA(8:4)
Virtex-4 FPCA(12:4)
Stratix-II Stratix-II FPCA(8:4)
Stratix-II FPCA(12:4)
Counters
CLBs
Experimental ResultsExperimental Results
10-Tap FIR Filter - Area Utilization
0
200
400
600
800
1000
1200
1400
1600
Virtex-4 Virtex-4 FPCA (8:4)
Virtex-4 FPCA (12:4)
Stratix-II Stratix-II FPCA (8:4)
Stratix-II FPCA (12:4)
Counters
CLBs
Experimental ResultsExperimental Results
20-Tap FIR Filter - Area Utilization
0
500
1000
1500
2000
2500
3000
3500
Virtex-4 Virtex-4 FPCA (8:4)
Virtex-4 FPCA (12:4)
Stratix-II Stratix-II FPCA (8:4)
Stratix-II FPCA (12:4)
Counters
CLBs
ConclusionConclusion
FPGA Performance for Arithmetic Circuits is Lacking
Hybrid FPGA/FPCA Accelerates Arithmetic Circuits
Significant Improvement in Area Utilization
Counters are the Correct Building Blocks for Multi-Input Additions
Marginal Improvements in Delay
FPGA – Fast Carry-Chain (No Routing Delay)
FPCA – All Wires Having Routing Delays
Naïve/No Retiming in These Experiments