Upload
adrian-gilbert
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler
University of Southern California / Information Sciences Institute4676 Admiralty Way, Suite 1001Marina del Rey, California 90292
DEFACTO: Combining Parallelizing Compiler Technology with Hardware
Behavioral Synthesis*
* The DEFACTO project was funded by the Information technology Office (ITO) of the Defense Advanced research project Agency (DARPA) under contract #F30602-98-2-0113.
2SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Outline
Background & Motivation Part 1: Application Mapping Example Part 2: Design Space Exploration Part 3: Challenges for Future FPGAs Related Work Conclusion
3SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
DEFACTO Objective & Goals
Objectives:• Automatically Map High-Level Applications
to Field-Programmable Hardware (FPGAs)• Explore Multiple Design Choices
Goal:• Make Reconfigurable Technology
Accessible to the Average Programmer
4SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
What Are FPGAs: Key Concepts
• Configurable Hardware• Reprogrammable (ms latency)
Architecture• Configurable Logic Blocks
“Universal” logic Some input/outputs latched
• Passive network between CLBs• Memories, processor cores
5SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Why Use FPGAs?
Advantages over Application-Specific Integrated Circuits (ASICs)
• Faster Time to Market• “Post silicon” Modification Possible• Reconfigurable, Possibly Even at Run-time
Advantages Over General-Purpose Processors• Application-Specific Customization (e.g., parallelism, small
data-widths, arithmetic, bandwidth)
Disadvantages• Slow (typical automatic design @25MHz)• Low Density of Transistors
6SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
How to Program FPGAs?
Hardware-Oriented Languages• VHDL or Verilog• Very Low-Level Programming
Commercial Tools (e.g., MonetTM)• Choose Implementation Based on User Constraints• Time and Space Trade-Off• Provide Estimations for Implementation
Problem: Too Slow for Large Complex Designs• Place-and-Route Can Take up to 8 Hours for Large Designs• Unclear What to Do When Things Go Wrong
7SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Behavioral Synthesis Example
variable A is std_logic_vector(0..7)…X <= (A * B) - (C * D) + F
6 Registers2 Multipliers
2 Adders/Subtractors1 (long) clock cycle
9 Registers2 Multipliers
2 Adders/Subtractors2 (shorter) clock cycles
13 Registers1 Multiplier
2 Adders/Subtractors3 (shorter) clock cycles
8SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Synthesizing FPGA Designs: Status
Technology Advances have led to Increasingly Large Parts• FPGAs now have Millions of “gates”
Current Practice is to Handcode Designs for FPGAs in Structural VHDL • Tedious and Error Prone• Requires Weeks to Months Even for Fairly
Simple Designs Higher-level Approach Needed!
9SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
DEFACTO: Key Ideas Parallelizing Compiler Technology
• Complements Behavioral Synthesis
• Adjusts Parallelism and Data Reuse
• Optimizes External Memory Accesses
Design Space Exploration• Evaluates and Compares Designs before Committing
to Hardware
• Improves Design Time Efficiency
• a form of Feedback-directed Optimization
10SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Opportunities: Parallelism & Storage
Behavioral Synthesis Parallelizing Compiler
Optimizations: Optimizations: Scalar Variables only Scalars & Multi-Dimensional Arrays inside Loop Body inside Loop Body & Across
Iterations
Supports User-Controlled Analysis Guides Automatic LoopLoop Unrolling Transformations
Manages Registers and Evaluates Tradeoffs of Different inter-operator Communication Memories, On- and Off-chip
Considers one FPGA System-level View
Performs Allocation, Binding & No Knowledge of Hardware Scheduling of Hardware Implementation
SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Part 1: Mapping Complete Designs from C to FPGAs
Sobel Edge Detection Example
12SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Example - Sobel Edge Detection
char img[IMAGE_SIZE][IMAGE_SIZE], edge [IMAGE_SIZE][IMAGE_SIZE];
int uh1, uh2, threshold;
for (i=0; i < IMAGE_SIZE - 4; i++) {
for (j=0; j < IMAGE_SIZE - 4; j++) {
uh1= (((-img[i][j]) + (- (2* img[i+1][j])) + (-img[i+2][j])) +
((img[i][j-2]) + (2* img[i+1][j-2]) + (img[i+2][j-2])));
uh2= (((-img[i][j]) + (img[i+2][j])) + (- (2* img[i][j-1])) +
(2* img[i+2][j-1]) + ((- img[i][j-2]) + (img[i][j-2])));
if ((abs(uh1) + abs(uh2)) < threshold)
edge[i][j] =”0xFF”;
else
edge[i][j] =”0x00;
}
}edgeimg
-1 -2 -1 0 0 0 1 2 1
1 0 -1 2 0 -2 1 0 -1
threshold
13SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Sobel - A Naïve Implementation
Large Number of Adders and Multipliers (shifts in this case) Too Many Memory Accesses !
• 8 Reads and 1 Write per Iteration of the Loop• Observation
Across 2 Iterations 4 out of 8 Values Can Be Reused
0x00
0xFF
img[i][j] img[i][j+1] img[i][j+2]
img[i+2][j] img[i+2][j+1] img[i+2][j+2]
img[i+1][j+2]img[i+1][j]
edge[i][j]
14SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Data Reuse Analysis - Sobelimg[i][j+1]
img[i+2][j+1]
img[i][j+2]
img[i+2][j+2]
img[i+1][j+2]
img[i][j]
img[i+2][j]
img[i+1][j]
d = (1,0) d = (2,0) d = (1,0)
img[i][j+1]
img[i+2][j+1]
img[i][j+2]
img[i+2][j+2]
img[i+1][j+2]
img[i][j]
img[i+2][j]
img[i+1][j]
d = (0,1)
d = (0,2)
d = (0,1)
15SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Data Reuse using Tapped-Delay Lines Reduce the Number of Memory Accesses Exploit Array Layout and Distribution
• Packing• Stripping
Examples:
0x00
0xFF
img[i][j]
img[i+1][j]
img[i+2][j]edge[i][j] edge[i][j]
img[i][j] img[i][j+1] img[i][j+2]
0x00
0xFF
Accesses = 0.25 + 0.25 + 0.25 + 0.25 = 1.0 Accesses = 1.0 + 1.0 + 1.0 + 1.0 = 4.0
16SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Overall Design Approach
MEM
MEM
MEM
MEM
ApplicationData-path
ApplicationData-path
Application Data-paths• Extract Body of Loops• Uses Behavioral Synthesis
Memory Interfaces• Uses Data Access Patterns to Generate Channel Specs• VHDL Library Templates
17SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
WildStarTM: A Complex Memory Hierarchy
SharedSharedMemory0Memory0SharedShared
Memory0Memory0
SharedSharedMemory2Memory2SharedShared
Memory2Memory2SharedShared
Memory3Memory3SharedShared
Memory3Memory3
SharedSharedMemory1Memory1SharedShared
Memory1Memory1
FPGA 1FPGA 1FPGA 1FPGA 1 FPGA 0FPGA 0FPGA 0FPGA 0 FPGA 2FPGA 2FPGA 2FPGA 2
SRAM1SRAM1SRAM1SRAM1
SRAM0SRAM0SRAM0SRAM0
SRAM3SRAM3SRAM3SRAM3
SRAM2SRAM2SRAM2SRAM2
PCIPCIControllerController
PCIPCIControllerController
To Off-BoardMemory
32bits
64bits
18SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Project Status Complex Infrastructure• Different Programming
Languages (C vs. VHDL)• Different EDA Tools• Different Vendors• Experimental Target• In-House Tools
Combines Compiler Techniques and Behavioral Synthesis
• Different Execution Models• Reconcile Representation
It Works!• Fully Automated for Single
FPGA designs• Modest Manual Intervention for
Multi-FPGA designs (simulation OK)
Compiler AnalysisCompiler Analysis
SUIF2VHDLSUIF2VHDL
Behavioral SynthesisBehavioral Synthesis& Estimation (Monet)& Estimation (Monet)
Logic SynthesisLogic Synthesis(Synplicity)(Synplicity)
Place & RoutePlace & Route(Xilinx Foundations)(Xilinx Foundations)
Code TransformationsCode Transformationsand Annotationsand Annotations
Annapolis WildStar Board
Algorithm Description
Computation & Data
Partitioning
Design Space Exploration
Memory Access
Protocols
19SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Sobel on the Annapolis WildStar BoardInput Image Output Image
Manual vs. AutomatedMetrics Manual Automated
Space (slices) 2238 2279 (2% increase)
Cycles 326K 518K
Clock Rate (MHz) 42 40
Execution Time (one frame) 7.7 ms (100%) 12.95 ms (159%)
Design Time about 1 week 42 minutes
SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Part 2: Design Space Exploration
Using Behavioral Synthesis Estimates
21SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Design Space Exploration (Current Practice)
Logic Synthesis / Place&Route
Design Specification (Low-level VHDL)
DesignModification
Validation / Evaluation
2 Weeks for a Working Design 2 Months for an Optimized Design
Correct?Good design?
22SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Design Space Exploration (Our Approach)
Algorithm (C/Fortran)
Compiler Optimizations (SUIF)• Unroll and Jam• Scalar Replacement• Custom Data Layout
SUIF2VHDL Translation
Behavioral Synthesis Estimation
Unroll Factor Selection
Logic Synthesis / Place&Route
Overall, Less than 2 hours 5 Minutes for Optimized Design Selection
23SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Problem Statement
Execution Time
Exploit parallelism,Reuse dataon chip
Space Requirements
More copies of operators, More on-chip registers
Constraint: Size of design less than FPGA capacity Goal: Minimal execution time Selection Criteria: For given performance, minimal space
• Frees up more space for other computations• Better clock rate achieved• Desirable to use on-chip space efficiently
24SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Balance Definition: Data Fetch Rate Consumption Rate
• Consumption Rate[bits/cycle] = data bits consumed per computation time
Limited by the Data Dependences of the Computation
• Data fetch Rate[bits/cycle] = data bits required per computation time
Limited by the FPGA’s Effective Memory Bandwidth
If balance > 1, Compute Bound If balance < 1, Memory Bound Balance suggests whether more resources should
be devoted to enhance computation or storage.
25SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Do I=1,N, by 2 A(I) = A(I-2) + B(I) A(I+1) = A(I-1) + B(I+1)
Loop Unrolling
Exposes fine-grain parallelism by replicating the loop body.
Do I=1, N A(I) = A(I-2) + B(I)
A(I)
B(I)A(I)
A(I-2)A(I)
A(I+1)A(I-1)
B(I+1)
B(I)
As Unrolling Factor Increases, both Data Fetch and Consumption Rate Increase.
2 2
2
26SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Monotonicity Properties
unroll factor
Data Fetch Rate (bits/cycle)
Saturationpoint
Data Consumption Rate (bits/cycle)
Saturation point
Balance(= Fetch/Consumption)
unroll factor
unroll factor
Saturation point: unroll factor that saturates memory bandwidth for a given architecture
27SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Balance & Optimal Unroll Factor
Unroll factor
Rate (bits/cycle)1
3
5
2
4
Saturation point max
Data fetch rate
Data consumption rate
Optimal solution
Balance Guides the Design Space Exploration.
28SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Experiments Multimedia Kernels
• FIR (Finite Impulse Response)• Matrix Multiply• Sobel (Edge Detection)• Pattern Matching• Jacobi (Five Point Stencil)
Methodology• Compiler Translates C to SUIF and Behavioral VHDL• Synthesis Tool Estimates Space and Computational Latency• Compiler Computes Balance and Execution Time Accounting for
Memory Latency Memory Latency
• Pipelined: 1 cycle for read and write• Non-pipelined: 7 cycles for read and 3 cycles for write
29SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
FIR
+ +x x
* *
Outer Loop Unroll Factor 1
Outer Loop Unroll Factor 2
Outer Loop Unroll Factor 4
Outer Loop Unroll Factor 8
Outer Loop Unroll Factor 16
Outer Loop Unroll Factor 32
Outer Loop Unroll Factor 64
Speedup: 17.26
Selected Design
30SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Matrix Multiply
+ +x x
Outer Loop Unroll Factor 1
Outer Loop Unroll Factor 2
Outer Loop Unroll Factor 4
Outer Loop Unroll Factor 8
Outer Loop Unroll Factor 16
Outer Loop Unroll Factor 32
Selected Design
Speedup: 13.36
31SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Efficiency of Design Space Exploration
Program Search Space Searched points
FIR 2048 (24) 3
Matrix Multiply 2048 (24) 4
Jacobi 512 (27) 5
Pattern 768 (20) 4
Sobel 2048 (24) 2
On average, only 0.3% (15%) of Space Searched.
32SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
FIR: Estimation vs. Accurate Data
Larger Designs Lead to Degradation in Clock Rates Compiler Can Use a Statistical Approach to Derive Confidence
Intervals for Space Our case: Compiler Makes Correct Decision using Imperfect Data
SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Part 3: Challenges for Future FPGAs
Heterogeneous Functional and Storage Resources
Data/Computation Partitioning and Scheduling Revisited
34SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Field-Programmable-Core-Arrays Large Number of Transistors
• Multiple Application Specific Cores• Customization of Interconnect• Other Specialized Logic
Challenges:• Data Partitioning:
Custom Storage Structures Allocation, Binding and Scheduling Replication and Reorganization
• Computation Partition Scheduling between Cores Coarse-Grain Pipelining
Revisiting Issues with Parallelizing Compiler Technology
IP CoreS-RAMD-RAM
IP CoreDSP
IP CoreARM
35SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Related Work Compilers for Special-purpose Configurable
Architectures• PipeRench (CMU), RaPiD (UW), RAW (MIT)
High-level Languages Oriented towards Hardware
• Handel-C, Cameron(CSU), PICO(HP), Napa-C (LANL)
Integrated Compiler and Logic Synthesis• Babb (MIT), Nimble (Synopsys)
Compiling from MatLab to FPGAs• Match compiler (Northwestern)
36SCIENCESSCIENCES
USCUSCINFORMATIONINFORMATION
INSTITUTEINSTITUTE
Conclusion Combines Behavioral Synthesis and Parallelizing
Compiler Technologies Fast & Automated Design Space Exploration
• Trades Space with Functional Units via Loop Unrolling• Uses Balance and Monotonicity Properties• Searches only 0.3% of the Entire Design Space
Near-optimal Performance and Smallest Space Future FPGAs
• Coarser-grained, Custom Functional and Storage Structures• Multiprocessor on a Chip• Data and Computation Partitioning and Coarse Grain Scheduling