View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Center for Embedded Computer SystemsUniversity of California, Irvine
http://www.cecs.uci.edu/~spark
Coordinated Coarse-Grain and Fine-Grain Optimizations for High-Level Synthesis
Topic Defense
Sumit Gupta
High Level SynthesisHigh Level Synthesis
M e m o r y
ALUCo
ntr
ol
Data path
d = e - f g = h + i
If NodeT F
c
x = a + bc = a < b
j = d x gl = e + x
x = a + b;c = a < b;if (c) then d = e – f;else g = h + i;j = d x g;l = e + x;
Transform behavioral descriptions to RTL/gate level
From C to CDFG to Architecture
High-level SynthesisHigh-level Synthesis Well-researched area: from early 1980’s – so what’s Well-researched area: from early 1980’s – so what’s
new ?new ? Level of design entry has moved up from schematic entry Level of design entry has moved up from schematic entry
to coding in hardware description languages (VHDL, to coding in hardware description languages (VHDL, Verliog, C)Verliog, C)
No comprehensive synthesis frameworkNo comprehensive synthesis framework Few and scattered optimizations: mostly algebraic and at Few and scattered optimizations: mostly algebraic and at
operation level of granularityoperation level of granularity Results presented for schedulingResults presented for scheduling
Effects on logic synthesis not understoodEffects on logic synthesis not understood Small, synthetic benchmarks: primarily data-intensive Small, synthetic benchmarks: primarily data-intensive
DSP algorithmsDSP algorithms Quality of synthesis results severely effected by Quality of synthesis results severely effected by
complex control flowcomplex control flow Nested ifs and loops not handled or handled poorlyNested ifs and loops not handled or handled poorly
Poor understanding of the interaction between Poor understanding of the interaction between source-level and fine grain “compiler” source-level and fine grain “compiler” transformationstransformations
Focus of this WorkFocus of this Work Target Applications:Target Applications:
Behavioral descriptions with Behavioral descriptions with complex and complex and nested conditionals and loopsnested conditionals and loops; for example:; for example:
mixed data and control-intensive multimedia and mixed data and control-intensive multimedia and image processing applicationsimage processing applications
control-intensive microprocessor blocks: control-intensive microprocessor blocks: resource rich, few highly packed cycles.resource rich, few highly packed cycles.
Objectives:Objectives: Improve quality of HLS results by Improve quality of HLS results by
concurrency enhancement concurrency enhancement Improve controllability of the HLS solutionsImprove controllability of the HLS solutions
Characteristics of Target Characteristics of Target ApplicationsApplications
Moderately Control-intensive behaviors Moderately Control-intensive behaviors Operations that execute under conditionsOperations that execute under conditions Entire behaviors within nested loops Entire behaviors within nested loops
Programming styles significantly effect quality of Programming styles significantly effect quality of results:results: Placement of operations and control-flowPlacement of operations and control-flow Choice of control flow: Nesting of ifs and loops Choice of control flow: Nesting of ifs and loops
A need for high-level and compiler A need for high-level and compiler transformationstransformations To overcome the variance due to programming style To overcome the variance due to programming style Increase Increase resource utilizationresource utilization in the presence of in the presence of
conditionalsconditionals Exploit mutual exclusivity of operations to enhanceExploit mutual exclusivity of operations to enhance
resource sharing resource sharing Maximally Parallelize Operations under given Resource Constraints
Maximally Parallelize Operations under given Resource Constraints
Recent Related WorkRecent Related Work Code motions in the presence of conditionalsCode motions in the presence of conditionals
Condition Vector List Scheduling [Condition Vector List Scheduling [Wakabayashi 89Wakabayashi 89]] Symbolic Scheduling [Symbolic Scheduling [Radivojevic 96Radivojevic 96]] WaveSched Scheduler [WaveSched Scheduler [Lakshminarayana 98Lakshminarayana 98]] Basic Block Control Graph Scheduling [Basic Block Control Graph Scheduling [Santos 99Santos 99]]
LimitationsLimitations Arbitrary nesting of conditionals and loops Arbitrary nesting of conditionals and loops
not handled or handled poorlynot handled or handled poorly Ad hoc optimizationsAd hoc optimizations
Not part of a complete synthesis systemNot part of a complete synthesis system Limited analysis of logic and control costs Limited analysis of logic and control costs
Parallelizing Compiler Parallelizing Compiler BackgroundBackground
Scheduling for increasing instruction-level Scheduling for increasing instruction-level parallelismparallelism Percolation SchedulingPercolation Scheduling
Can produce optimal schedule given enough resources Can produce optimal schedule given enough resources TrailblazingTrailblazing
Hierarchical Code Motion TechniqueHierarchical Code Motion Technique Trace Scheduling, Superblock and Hyperblock Trace Scheduling, Superblock and Hyperblock
SchedulingScheduling Loop TransformationsLoop Transformations
Loop Invariant Code MotionLoop Invariant Code Motion Loop PipeliningLoop Pipelining Induction Variable AnalysisInduction Variable Analysis Loop fusion, interchange, distributionLoop fusion, interchange, distribution
Partial evaluationPartial evaluation CSE, Copy Propagation, Constant FoldingCSE, Copy Propagation, Constant Folding
In the Context of High-Level In the Context of High-Level SynthesisSynthesis
Cost Models are differentCost Models are different Operation and Resource ModelsOperation and Resource Models
Non-sequential designsNon-sequential designs Transformations have implications on Transformations have implications on
hardwarehardware Non-trivial control costsNon-trivial control costs Operation duplication leads to flexible Operation duplication leads to flexible
scheduling ; however, can lead to higher scheduling ; however, can lead to higher control costs control costs
Mutual exclusivity of operations Mutual exclusivity of operations Resource SharingResource Sharing
Coarse and Fine-Grain Coarse and Fine-Grain Code OptimizationsCode Optimizations
Beyond Basic Block Code MotionsBeyond Basic Block Code Motions SpeculationSpeculation Reverse SpeculationReverse Speculation Early Condition ExecutionEarly Condition Execution Conditional SpeculationConditional Speculation
Dynamic Common Sub-expression Dynamic Common Sub-expression EliminationElimination
Loop UnrollingLoop Unrolling Loop Index Variable EliminationLoop Index Variable Elimination
Chaining Operations across ConditionalsChaining Operations across Conditionals
Concurrency Enhancement by Concurrency Enhancement by Code MotionsCode Motions
+
+If Node
T F T F
+ +
Reverse Speculation
Conditional Speculation
_ _
Speculation
+Across Hierarchical
Blocks_
_
a
b
c
Hierarchical Task Graph Representation of Control-Data Flow Graph
Resource Utilization
Concurrency Enhancement by Concurrency Enhancement by Code MotionsCode Motions
+
+
If Node
T F T F
+ +
Reverse Speculation
Conditional Speculation
_ _
Speculation
+Across Hierarchical
Blocks_
_
a
b
c
Hierarchical Task Graph Representation of Control-Data Flow Graph
Resource Utilization
Leads to Higher Resource UtilizationShorter Schedule Lengths
Leads to Higher Resource UtilizationShorter Schedule Lengths
Scheduling HeuristicScheduling Heuristic
BB 1 BB 2
BB 0
BB 5 BB 6
BB 4
BB 3
BB 7
+
+
+
Speculate
c
b
d
+
AcrossHTG
AcrossHTG
Speculate
Across HTG
+a Get Available Get Available OpsOps a, b, c, da, b, c, d
Determine Code Determine Code Motions Motions RequiredRequired
Assign Cost to Assign Cost to each Operationeach Operation
Schedule Op Schedule Op with lowest Costwith lowest Cost
BB 1 BB 2
BB 0
BB 5 BB 6
BB 4
BB 3
BB 7
+
+ c
b
+a
+ d
Scheduling HeuristicScheduling Heuristic
BB 1 BB 2
BB 0
BB 5 BB 6
BB 4
BB 3
BB 7
+
+
+
c
b
d
+
+
AcrossHTG
ConditionalSpeculation
+a
+ d
Dynamic Common Sub-expression Dynamic Common Sub-expression Elimination Elimination
BB 1 BB 2
BB 0
a = b + c
BB 5 BB 6
BB 4
d = b + c
BB 3
BB 7
Speculate
BB 1 BB 2
BB 0
a = dcse
BB 5 BB 6
BB 4
d = dcse
BB 3
BB 7
dcse = b + c
Interconnect minimization Interconnect minimization by resource binding by resource binding
Minimize the complexity of steering Minimize the complexity of steering logic logic Multiplexors and demultiplexorsMultiplexors and demultiplexors
Introduce additional interconnect Introduce additional interconnect constraints/costs during resource constraints/costs during resource bindingbinding
Operation and Variable binding have Operation and Variable binding have been formulated as network flow been formulated as network flow problemsproblems
Operation BindingOperation Binding
+
a b
c
+
e b
f
ALU
ea
c f
b
Bind Operations with the same inputs or outputsto the same functional unit
Variable BindingVariable Binding
ALU
ea
c f
b
Bind Variables that are inputs or outputs to same functional unit to the same registers
Variable BindingVariable Binding
ALU
ea
c f
b
Bind Variables that are inputs or outputs to same functional unit to the same registers
ImplementationImplementationSPARK High Level Synthesis SPARK High Level Synthesis
FrameworkFramework
Experimental SetupExperimental Setup Benchmarks derived from several industrial Benchmarks derived from several industrial
designsdesigns MPEG-1 Prediction BlockMPEG-1 Prediction Block ADPCM EncoderADPCM Encoder Several image processing passes from GIMP softwareSeveral image processing passes from GIMP software
Synthesized using Spark Synthesized using Spark Number of States in FSMNumber of States in FSM Cycles on Longest Path in DesignCycles on Longest Path in Design
RTL VHDL from Spark synthesized using RTL VHDL from Spark synthesized using Synopsys Synopsys Critical Path Length (ns) => dictates Clock PeriodCritical Path Length (ns) => dictates Clock Period Unit Area (in terms of synthesis library used)Unit Area (in terms of synthesis library used)
HLS Results for Code MotionsHLS Results for Code Motions
0
0.2
0.4
0.6
0.8
1
Num
ber o
f Sta
tes
(Nor
mal
ized
)
ADPCMEncode
MPEGcalc_forw
MPEG pred2
Within Basic BlocksWithin BBs, Across Hierarchical BlocksWithin BBs, Across Hier Blocks, SpeculationWithin BBs, Across Hier Blocks, Speculation, Early Condition ExecutionWithin BBs, Across Hier Blocks, Speculation, Early Cond Exec, Conditional Speculation
Allowed Code MotionsOverall Performance gains of up to 50 % in controller size
and longest path cycles
Overall Performance gains of up to 50 % in controller size and longest path cycles
Number of StatesIn FSM Controller
0
0.2
0.4
0.6
0.8
1
Cycl
es o
n Lo
nges
t Pat
h(N
orm
aliz
ed)
ADPCMEncode
MPEGcalc_forw
MPEG pred2
Cycles on LongestPath through Design
Logic Synthesis Results for Logic Synthesis Results for Code MotionsCode Motions
0
0.2
0.4
0.6
0.8
1
1.2
Nor
mal
ized
CriticalPath (c ns)
LongestPath
(l cycles)
Delay(c*l ns)
Unit Area
Synthesis Results for the MPEG Pred2 function using LSI-10K Synthesis Library
0
0.2
0.4
0.6
0.8
1
1.2
Nor
mal
ized
CriticalPath (c ns)
LongestPath
(l cycles)
Delay(c*l ns)
Unit Area
Synthesis Results for the ADPCM Encoder function using LSI-10K Synthesis Library
Within Basic BlocksWithin BBs, Across Hierarchical Blocks, SpeculationWithin BBs, Across Hier Blocks, Speculation, Early Condition ExecutionWithin BBs, Across Hier Blocks, Speculation, Early Cond Exec, Conditional Speculation
Allowed Code Motions
Enabling all code motions leads toEnabling all code motions leads to Reduced Circuit Delays: upto 50 %Reduced Circuit Delays: upto 50 % Increased Area/interconnect costs:Increased Area/interconnect costs:
Reduced by interconnect aware resource Reduced by interconnect aware resource bindingbinding
Enabling all code motions leads toEnabling all code motions leads to Reduced Circuit Delays: upto 50 %Reduced Circuit Delays: upto 50 % Increased Area/interconnect costs:Increased Area/interconnect costs:
Reduced by interconnect aware resource Reduced by interconnect aware resource bindingbinding
0
0.5
1
Norma
lized
MPEG Pred2 function synthesized using LSI-10K Library
Critical Path
TotalDelay
Unit Area
0
0.5
1
Norma
lized
ADPCM Encoder function synthesized using LSI-10K Library
Critical Path
TotalDelay
Unit Area
Naïve Resource Binding
Interconnect Minimizing Resource Binding
Reductions in area of between 15-32 % Fairly constant critical path lengths and circuit delay
Reductions in area of between 15-32 % Fairly constant critical path lengths and circuit delay
Results after Interconnect Results after Interconnect Minimization Minimization
Synthesis Results with Synthesis Results with Dynamic CSEDynamic CSE
MPEG Pred2 Function
0
0.2
0.4
0.6
0.8
1
1.2
Num of States Longest Path(cycles)
Num of Regs Critical Path (cns)
Unit Area
No
rma
lize
d V
alu
es
No CSEWith CSE
With Dynamic CSEWith CSE & Dynamic CSE
DCSE Synthesis Results: DCSE Synthesis Results: Pred0Pred0MPEG Pred0 Function
0
0.2
0.4
0.6
0.8
1
1.2
Num ofStates
Longest Path(cycles)
Num of Regs Critical Path(c ns)
Unit Area
No
rma
lize
d V
alu
es
No CSEWith CSE
With Dynamic CSEWith CSE & Dynamic CSE
Delays reduce by up to 40 %
Area reduces by up to 35 %
Register Usage Reduces !
Delays reduce by up to 40 %
Area reduces by up to 35 %
Register Usage Reduces !
Priority-based List Priority-based List Scheduling HeuristicScheduling Heuristic Allows control of Code Allows control of Code
Motions employedMotions employed Dynamic application of Dynamic application of
CSE and Copy CSE and Copy PropagationPropagation
Summary of Work DoneSummary of Work Done Speculative Code Speculative Code
MotionsMotions Code Motion Code Motion
TechniquesTechniques TrailblazingTrailblazing
Compiler PassesCompiler Passes Copy & Constant Copy & Constant
PropagationPropagation Dead Code EliminationDead Code Elimination Common Common
SubExpression SubExpression EliminationElimination
Dynamic RenamingDynamic Renaming Loop UnrollingLoop Unrolling
Loop Index Variable Loop Index Variable EliminationElimination
Chaining across Conditional blocks
Interconnect Minimizing Interconnect Minimizing Resource BindingResource Binding
FSM GenerationFSM Generation Non-trivial in the Non-trivial in the
presence of chaining presence of chaining across conditionals and across conditionals and multi-cycle operationsmulti-cycle operations
VHDL GenerationVHDL Generation
Future DirectionsFuture Directions
Interactive GUI: ability toInteractive GUI: ability to Specify scheduling decisionsSpecify scheduling decisions Timing ConstraintsTiming Constraints
Loop Pipelining HeurisiticLoop Pipelining Heurisitic Loop TransformationsLoop Transformations
Loop Fusion Effects of Code Motions Effects of Code Motions
on Poweron Power
Ability to model Ability to model Complex ResourcesComplex Resources
Pipelined ResourcesPipelined Resources
Loop Pipelining HeurisiticLoop Pipelining Heurisitic Loop TransformationsLoop Transformations
Loop FusionLoop Fusion Analysis of Effects of Analysis of Effects of
Code Motions on PowerCode Motions on Power
More Transformations More Transformations targeting targeting Microprocessor Microprocessor Functional BlocksFunctional Blocks
Loop Invariant Code Loop Invariant Code MotionMotion
Loop Invariant Loop Invariant Code MotionCode Motion
Thank YouThank You
PublicationsPublications Dynamic Common Sub-Expression Elimination during Scheduling in High-Dynamic Common Sub-Expression Elimination during Scheduling in High-
Level SynthesisLevel Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. NicolauS. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, , To appear in the To appear in the International Symposium on System SynthesisInternational Symposium on System Synthesis, October , October 2002 2002
Coordinated Transformations for High-Level Synthesis of High Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor BlocksPerformance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, Design Automation Design Automation ConferenceConference, June 2002, June 2002
Conditional Speculation and its Effects on Performance and Area for High-Conditional Speculation and its Effects on Performance and Area for High-Level SynthesisLevel Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSSISSS 2001 2001
Speculation Techniques for High Level synthesis of Control Intensive Speculation Techniques for High Level synthesis of Control Intensive DesignsDesigns S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, DACDAC 2001 2001
Analysis of High-level Address Code Transformations for Programmable Analysis of High-level Address Code Transformations for Programmable ProcessorsProcessorsS. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, S. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, DATEDATE 2000 2000
Book Chapter: ASIC Design, ASIC Design, S. Gupta, R. K. Gupta, S. Gupta, R. K. Gupta, Chapter 64, The VLSI Handbook,
Edited by Wai-Kai Chen,Under Submission to Journal: Using Global Code Motions to Improve the Quality of Results for High-Level Using Global Code Motions to Improve the Quality of Results for High-Level
Synthesis, Synthesis, S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, submitted to TCADS. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, submitted to TCAD
Additional SlidesAdditional Slides
SPARK Core StrengthsSPARK Core Strengths Focus onFocus on
Transformations that increase amount of Transformations that increase amount of parallelism available in the source descriptionparallelism available in the source description
Tightly integrate with parallelizing compiler Tightly integrate with parallelizing compiler transformationstransformations
Provide a HLS “toolbox” for the micro-architectProvide a HLS “toolbox” for the micro-architect Develop transformations thatDevelop transformations that
Limit effects of control-flow Limit effects of control-flow Generalized code motionsGeneralized code motions
Reduce data dependenciesReduce data dependencies Renaming, loop unrolling, loop index variable Renaming, loop unrolling, loop index variable
eliminationelimination
SPARK FrameworkSPARK Framework Customizable extensible schedulerCustomizable extensible scheduler
Range of transformations in modular toolboxRange of transformations in modular toolbox Percolation, trailblazing, loop pipelining (RDLP)Percolation, trailblazing, loop pipelining (RDLP)
Selected under heuristics and/or user controlSelected under heuristics and/or user control Code motion, loop transformationsCode motion, loop transformations
Input in C and output to synthesizable RTL Input in C and output to synthesizable RTL VHDLVHDL Flow from architecture design to synthesisFlow from architecture design to synthesis
Quality of results measured in terms ofQuality of results measured in terms of Scheduling results: cycles in longest pathScheduling results: cycles in longest path Controller size: number of states in FSMController size: number of states in FSM Logic synthesis results: critical path length,unit areaLogic synthesis results: critical path length,unit area
Summary of Work DoneSummary of Work Done Developed a set of code transformations Developed a set of code transformations
targeted towards HLStargeted towards HLS Implemented in a complete high-level Implemented in a complete high-level
synthesis frameworksynthesis framework Implemented supporting compiler passesImplemented supporting compiler passes Produce synthesizable VHDL output from Produce synthesizable VHDL output from
input Cinput C Analyzed effects of transformations on Analyzed effects of transformations on
final logic synthesis resultsfinal logic synthesis results Applied to moderately complex Applied to moderately complex
industrial benchmarks industrial benchmarks
Ongoing WorkOngoing Work Loop TransformationsLoop Transformations
Loop Invariant Code MotionLoop Invariant Code Motion Loop Pipelining HeuristicsLoop Pipelining Heuristics Loop FusionLoop Fusion
High-level Power analysis of High-level Power analysis of transformationstransformations Can Power consumption be reduced Can Power consumption be reduced
despite increased resource utilizationdespite increased resource utilization
BB 1 BB 2
BB 0
BB 5 BB 6
BB 4
BB 3
BB 7
+
+
+
c
b
d
+a
Scheduler HeuristicScheduler Heuristic
BB 1 BB 2
BB 0
BB 5 BB 6
BB 4
BB 3
BB 7
+
+
+
+
Speculate
c
a
b
d
+
AcrossHTG
AcrossHTG
Speculate
Across HTG
+
AcrossHTG
ConditionalSpeculation
+a1 + a2
ReverseSpeculate