Center for Embedded Computer Systems University of California, Irvine spark Coordinated Coarse-Grain and Fine-Grain Optimizations

Center for Embedded Computer SystemsUniversity of California, Irvine

http://www.cecs.uci.edu/~spark

Coordinated Coarse-Grain and Fine-Grain Optimizations for High-Level Synthesis

Topic Defense

Sumit Gupta

High Level SynthesisHigh Level Synthesis

M e m o r y

ALUCo

ntr

ol

Data path

d = e - f g = h + i

If NodeT F

c

x = a + bc = a < b

j = d x gl = e + x

x = a + b;c = a < b;if (c) then d = e – f;else g = h + i;j = d x g;l = e + x;

Transform behavioral descriptions to RTL/gate level

From C to CDFG to Architecture

High-level SynthesisHigh-level Synthesis Well-researched area: from early 1980’s – so what’s Well-researched area: from early 1980’s – so what’s

new ?new ? Level of design entry has moved up from schematic entry Level of design entry has moved up from schematic entry

to coding in hardware description languages (VHDL, to coding in hardware description languages (VHDL, Verliog, C)Verliog, C)

No comprehensive synthesis frameworkNo comprehensive synthesis framework Few and scattered optimizations: mostly algebraic and at Few and scattered optimizations: mostly algebraic and at

operation level of granularityoperation level of granularity Results presented for schedulingResults presented for scheduling

Effects on logic synthesis not understoodEffects on logic synthesis not understood Small, synthetic benchmarks: primarily data-intensive Small, synthetic benchmarks: primarily data-intensive

DSP algorithmsDSP algorithms Quality of synthesis results severely effected by Quality of synthesis results severely effected by

complex control flowcomplex control flow Nested ifs and loops not handled or handled poorlyNested ifs and loops not handled or handled poorly

Poor understanding of the interaction between Poor understanding of the interaction between source-level and fine grain “compiler” source-level and fine grain “compiler” transformationstransformations

Focus of this WorkFocus of this Work Target Applications:Target Applications:

Behavioral descriptions with Behavioral descriptions with complex and complex and nested conditionals and loopsnested conditionals and loops; for example:; for example:

mixed data and control-intensive multimedia and mixed data and control-intensive multimedia and image processing applicationsimage processing applications

control-intensive microprocessor blocks: control-intensive microprocessor blocks: resource rich, few highly packed cycles.resource rich, few highly packed cycles.

Objectives:Objectives: Improve quality of HLS results by Improve quality of HLS results by

concurrency enhancement concurrency enhancement Improve controllability of the HLS solutionsImprove controllability of the HLS solutions

Characteristics of Target Characteristics of Target ApplicationsApplications

Moderately Control-intensive behaviors Moderately Control-intensive behaviors Operations that execute under conditionsOperations that execute under conditions Entire behaviors within nested loops Entire behaviors within nested loops

Programming styles significantly effect quality of Programming styles significantly effect quality of results:results: Placement of operations and control-flowPlacement of operations and control-flow Choice of control flow: Nesting of ifs and loops Choice of control flow: Nesting of ifs and loops

A need for high-level and compiler A need for high-level and compiler transformationstransformations To overcome the variance due to programming style To overcome the variance due to programming style Increase Increase resource utilizationresource utilization in the presence of in the presence of

conditionalsconditionals Exploit mutual exclusivity of operations to enhanceExploit mutual exclusivity of operations to enhance

resource sharing resource sharing Maximally Parallelize Operations under given Resource Constraints

Maximally Parallelize Operations under given Resource Constraints

Recent Related WorkRecent Related Work Code motions in the presence of conditionalsCode motions in the presence of conditionals

Condition Vector List Scheduling [Condition Vector List Scheduling [Wakabayashi 89Wakabayashi 89]] Symbolic Scheduling [Symbolic Scheduling [Radivojevic 96Radivojevic 96]] WaveSched Scheduler [WaveSched Scheduler [Lakshminarayana 98Lakshminarayana 98]] Basic Block Control Graph Scheduling [Basic Block Control Graph Scheduling [Santos 99Santos 99]]

LimitationsLimitations Arbitrary nesting of conditionals and loops Arbitrary nesting of conditionals and loops

not handled or handled poorlynot handled or handled poorly Ad hoc optimizationsAd hoc optimizations

Not part of a complete synthesis systemNot part of a complete synthesis system Limited analysis of logic and control costs Limited analysis of logic and control costs

Parallelizing Compiler Parallelizing Compiler BackgroundBackground

Scheduling for increasing instruction-level Scheduling for increasing instruction-level parallelismparallelism Percolation SchedulingPercolation Scheduling

Can produce optimal schedule given enough resources Can produce optimal schedule given enough resources TrailblazingTrailblazing

Hierarchical Code Motion TechniqueHierarchical Code Motion Technique Trace Scheduling, Superblock and Hyperblock Trace Scheduling, Superblock and Hyperblock

SchedulingScheduling Loop TransformationsLoop Transformations

Loop Invariant Code MotionLoop Invariant Code Motion Loop PipeliningLoop Pipelining Induction Variable AnalysisInduction Variable Analysis Loop fusion, interchange, distributionLoop fusion, interchange, distribution

Partial evaluationPartial evaluation CSE, Copy Propagation, Constant FoldingCSE, Copy Propagation, Constant Folding

In the Context of High-Level In the Context of High-Level SynthesisSynthesis

Cost Models are differentCost Models are different Operation and Resource ModelsOperation and Resource Models

Non-sequential designsNon-sequential designs Transformations have implications on Transformations have implications on

hardwarehardware Non-trivial control costsNon-trivial control costs Operation duplication leads to flexible Operation duplication leads to flexible

scheduling ; however, can lead to higher scheduling ; however, can lead to higher control costs control costs

Mutual exclusivity of operations Mutual exclusivity of operations Resource SharingResource Sharing

Coarse and Fine-Grain Coarse and Fine-Grain Code OptimizationsCode Optimizations

Beyond Basic Block Code MotionsBeyond Basic Block Code Motions SpeculationSpeculation Reverse SpeculationReverse Speculation Early Condition ExecutionEarly Condition Execution Conditional SpeculationConditional Speculation

Dynamic Common Sub-expression Dynamic Common Sub-expression EliminationElimination

Loop UnrollingLoop Unrolling Loop Index Variable EliminationLoop Index Variable Elimination

Chaining Operations across ConditionalsChaining Operations across Conditionals

Concurrency Enhancement by Concurrency Enhancement by Code MotionsCode Motions

+

+If Node

T F T F

+ +

Reverse Speculation

Conditional Speculation

_ _

Speculation

+Across Hierarchical

Blocks_

_

a

b

c

Hierarchical Task Graph Representation of Control-Data Flow Graph

Resource Utilization

Concurrency Enhancement by Concurrency Enhancement by Code MotionsCode Motions

+

+

If Node

T F T F

+ +

Reverse Speculation

Conditional Speculation

_ _

Speculation

+Across Hierarchical

Blocks_

_

a

b

c

Hierarchical Task Graph Representation of Control-Data Flow Graph

Resource Utilization

Leads to Higher Resource UtilizationShorter Schedule Lengths

Leads to Higher Resource UtilizationShorter Schedule Lengths

Scheduling HeuristicScheduling Heuristic

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

BB 3

BB 7

+

+

+

Speculate

c

b

d

+

AcrossHTG

AcrossHTG

Speculate

Across HTG

+a Get Available Get Available OpsOps a, b, c, da, b, c, d

Determine Code Determine Code Motions Motions RequiredRequired

Assign Cost to Assign Cost to each Operationeach Operation

Schedule Op Schedule Op with lowest Costwith lowest Cost

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

BB 3

BB 7

+

+ c

b

+a

+ d

Scheduling HeuristicScheduling Heuristic

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

BB 3

BB 7

+

+

+

c

b

d

+

+

AcrossHTG

ConditionalSpeculation

+a

+ d

Dynamic Common Sub-expression Dynamic Common Sub-expression Elimination Elimination

BB 1 BB 2

BB 0

a = b + c

BB 5 BB 6

BB 4

d = b + c

BB 3

BB 7

Speculate

BB 1 BB 2

BB 0

a = dcse

BB 5 BB 6

BB 4

d = dcse

BB 3

BB 7

dcse = b + c

Interconnect minimization Interconnect minimization by resource binding by resource binding

Minimize the complexity of steering Minimize the complexity of steering logic logic Multiplexors and demultiplexorsMultiplexors and demultiplexors

Introduce additional interconnect Introduce additional interconnect constraints/costs during resource constraints/costs during resource bindingbinding

Operation and Variable binding have Operation and Variable binding have been formulated as network flow been formulated as network flow problemsproblems

Operation BindingOperation Binding

+

a b

c

+

e b

f

ALU

ea

c f

b

Bind Operations with the same inputs or outputsto the same functional unit

Variable BindingVariable Binding

ALU

ea

c f

b

Bind Variables that are inputs or outputs to same functional unit to the same registers

Variable BindingVariable Binding

ALU

ea

c f

b

Bind Variables that are inputs or outputs to same functional unit to the same registers

ImplementationImplementationSPARK High Level Synthesis SPARK High Level Synthesis

FrameworkFramework

Experimental SetupExperimental Setup Benchmarks derived from several industrial Benchmarks derived from several industrial

designsdesigns MPEG-1 Prediction BlockMPEG-1 Prediction Block ADPCM EncoderADPCM Encoder Several image processing passes from GIMP softwareSeveral image processing passes from GIMP software

Synthesized using Spark Synthesized using Spark Number of States in FSMNumber of States in FSM Cycles on Longest Path in DesignCycles on Longest Path in Design

RTL VHDL from Spark synthesized using RTL VHDL from Spark synthesized using Synopsys Synopsys Critical Path Length (ns) => dictates Clock PeriodCritical Path Length (ns) => dictates Clock Period Unit Area (in terms of synthesis library used)Unit Area (in terms of synthesis library used)

HLS Results for Code MotionsHLS Results for Code Motions

0

0.2

0.4

0.6

0.8

1

Num

ber o

f Sta

tes

(Nor

mal

ized

)

ADPCMEncode

MPEGcalc_forw

MPEG pred2

Within Basic BlocksWithin BBs, Across Hierarchical BlocksWithin BBs, Across Hier Blocks, SpeculationWithin BBs, Across Hier Blocks, Speculation, Early Condition ExecutionWithin BBs, Across Hier Blocks, Speculation, Early Cond Exec, Conditional Speculation

Allowed Code MotionsOverall Performance gains of up to 50 % in controller size

and longest path cycles

Overall Performance gains of up to 50 % in controller size and longest path cycles

Number of StatesIn FSM Controller

0

0.2

0.4

0.6

0.8

1

Cycl

es o

n Lo

nges

t Pat

h(N

orm

aliz

ed)

ADPCMEncode

MPEGcalc_forw

MPEG pred2

Cycles on LongestPath through Design

Logic Synthesis Results for Logic Synthesis Results for Code MotionsCode Motions

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

CriticalPath (c ns)

LongestPath

(l cycles)

Delay(c*l ns)

Unit Area

Synthesis Results for the MPEG Pred2 function using LSI-10K Synthesis Library

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

CriticalPath (c ns)

LongestPath

(l cycles)

Delay(c*l ns)

Unit Area

Synthesis Results for the ADPCM Encoder function using LSI-10K Synthesis Library

Within Basic BlocksWithin BBs, Across Hierarchical Blocks, SpeculationWithin BBs, Across Hier Blocks, Speculation, Early Condition ExecutionWithin BBs, Across Hier Blocks, Speculation, Early Cond Exec, Conditional Speculation

Allowed Code Motions

Enabling all code motions leads toEnabling all code motions leads to Reduced Circuit Delays: upto 50 %Reduced Circuit Delays: upto 50 % Increased Area/interconnect costs:Increased Area/interconnect costs:

Reduced by interconnect aware resource Reduced by interconnect aware resource bindingbinding

Enabling all code motions leads toEnabling all code motions leads to Reduced Circuit Delays: upto 50 %Reduced Circuit Delays: upto 50 % Increased Area/interconnect costs:Increased Area/interconnect costs:

Reduced by interconnect aware resource Reduced by interconnect aware resource bindingbinding

0

0.5

1

Norma

lized

MPEG Pred2 function synthesized using LSI-10K Library

Critical Path

TotalDelay

Unit Area

0

0.5

1

Norma

lized

ADPCM Encoder function synthesized using LSI-10K Library

Critical Path

TotalDelay

Unit Area

Naïve Resource Binding

Interconnect Minimizing Resource Binding

Reductions in area of between 15-32 % Fairly constant critical path lengths and circuit delay

Reductions in area of between 15-32 % Fairly constant critical path lengths and circuit delay

Results after Interconnect Results after Interconnect Minimization Minimization

Synthesis Results with Synthesis Results with Dynamic CSEDynamic CSE

MPEG Pred2 Function

0

0.2

0.4

0.6

0.8

1

1.2

Num of States Longest Path(cycles)

Num of Regs Critical Path (cns)

Unit Area

No

rma

lize

d V

alu

es

No CSEWith CSE

With Dynamic CSEWith CSE & Dynamic CSE

DCSE Synthesis Results: DCSE Synthesis Results: Pred0Pred0MPEG Pred0 Function

0

0.2

0.4

0.6

0.8

1

1.2

Num ofStates

Longest Path(cycles)

Num of Regs Critical Path(c ns)

Unit Area

No

rma

lize

d V

alu

es

No CSEWith CSE

With Dynamic CSEWith CSE & Dynamic CSE

Delays reduce by up to 40 %

Area reduces by up to 35 %

Register Usage Reduces !

Delays reduce by up to 40 %

Area reduces by up to 35 %

Register Usage Reduces !

Priority-based List Priority-based List Scheduling HeuristicScheduling Heuristic Allows control of Code Allows control of Code

Motions employedMotions employed Dynamic application of Dynamic application of

CSE and Copy CSE and Copy PropagationPropagation

Summary of Work DoneSummary of Work Done Speculative Code Speculative Code

MotionsMotions Code Motion Code Motion

TechniquesTechniques TrailblazingTrailblazing

Compiler PassesCompiler Passes Copy & Constant Copy & Constant

PropagationPropagation Dead Code EliminationDead Code Elimination Common Common

SubExpression SubExpression EliminationElimination

Dynamic RenamingDynamic Renaming Loop UnrollingLoop Unrolling

Loop Index Variable Loop Index Variable EliminationElimination

Chaining across Conditional blocks

Interconnect Minimizing Interconnect Minimizing Resource BindingResource Binding

FSM GenerationFSM Generation Non-trivial in the Non-trivial in the

presence of chaining presence of chaining across conditionals and across conditionals and multi-cycle operationsmulti-cycle operations

VHDL GenerationVHDL Generation

Future DirectionsFuture Directions

Interactive GUI: ability toInteractive GUI: ability to Specify scheduling decisionsSpecify scheduling decisions Timing ConstraintsTiming Constraints

Loop Pipelining HeurisiticLoop Pipelining Heurisitic Loop TransformationsLoop Transformations

Loop Fusion Effects of Code Motions Effects of Code Motions

on Poweron Power

Ability to model Ability to model Complex ResourcesComplex Resources

Pipelined ResourcesPipelined Resources

Loop Pipelining HeurisiticLoop Pipelining Heurisitic Loop TransformationsLoop Transformations

Loop FusionLoop Fusion Analysis of Effects of Analysis of Effects of

Code Motions on PowerCode Motions on Power

More Transformations More Transformations targeting targeting Microprocessor Microprocessor Functional BlocksFunctional Blocks

Loop Invariant Code Loop Invariant Code MotionMotion

Loop Invariant Loop Invariant Code MotionCode Motion

Thank YouThank You

PublicationsPublications Dynamic Common Sub-Expression Elimination during Scheduling in High-Dynamic Common Sub-Expression Elimination during Scheduling in High-

Level SynthesisLevel Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. NicolauS. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, , To appear in the To appear in the International Symposium on System SynthesisInternational Symposium on System Synthesis, October , October 2002 2002

Coordinated Transformations for High-Level Synthesis of High Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor BlocksPerformance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, Design Automation Design Automation ConferenceConference, June 2002, June 2002

Conditional Speculation and its Effects on Performance and Area for High-Conditional Speculation and its Effects on Performance and Area for High-Level SynthesisLevel Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSSISSS 2001 2001

Speculation Techniques for High Level synthesis of Control Intensive Speculation Techniques for High Level synthesis of Control Intensive DesignsDesigns S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, DACDAC 2001 2001

Analysis of High-level Address Code Transformations for Programmable Analysis of High-level Address Code Transformations for Programmable ProcessorsProcessorsS. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, S. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, DATEDATE 2000 2000

Book Chapter: ASIC Design, ASIC Design, S. Gupta, R. K. Gupta, S. Gupta, R. K. Gupta, Chapter 64, The VLSI Handbook,

Edited by Wai-Kai Chen,Under Submission to Journal: Using Global Code Motions to Improve the Quality of Results for High-Level Using Global Code Motions to Improve the Quality of Results for High-Level

Synthesis, Synthesis, S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, submitted to TCADS. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, submitted to TCAD

Additional SlidesAdditional Slides

SPARK Core StrengthsSPARK Core Strengths Focus onFocus on

Transformations that increase amount of Transformations that increase amount of parallelism available in the source descriptionparallelism available in the source description

Tightly integrate with parallelizing compiler Tightly integrate with parallelizing compiler transformationstransformations

Provide a HLS “toolbox” for the micro-architectProvide a HLS “toolbox” for the micro-architect Develop transformations thatDevelop transformations that

Limit effects of control-flow Limit effects of control-flow Generalized code motionsGeneralized code motions

Reduce data dependenciesReduce data dependencies Renaming, loop unrolling, loop index variable Renaming, loop unrolling, loop index variable

eliminationelimination

SPARK FrameworkSPARK Framework Customizable extensible schedulerCustomizable extensible scheduler

Range of transformations in modular toolboxRange of transformations in modular toolbox Percolation, trailblazing, loop pipelining (RDLP)Percolation, trailblazing, loop pipelining (RDLP)

Selected under heuristics and/or user controlSelected under heuristics and/or user control Code motion, loop transformationsCode motion, loop transformations

Input in C and output to synthesizable RTL Input in C and output to synthesizable RTL VHDLVHDL Flow from architecture design to synthesisFlow from architecture design to synthesis

Quality of results measured in terms ofQuality of results measured in terms of Scheduling results: cycles in longest pathScheduling results: cycles in longest path Controller size: number of states in FSMController size: number of states in FSM Logic synthesis results: critical path length,unit areaLogic synthesis results: critical path length,unit area

Summary of Work DoneSummary of Work Done Developed a set of code transformations Developed a set of code transformations

targeted towards HLStargeted towards HLS Implemented in a complete high-level Implemented in a complete high-level

synthesis frameworksynthesis framework Implemented supporting compiler passesImplemented supporting compiler passes Produce synthesizable VHDL output from Produce synthesizable VHDL output from

input Cinput C Analyzed effects of transformations on Analyzed effects of transformations on

final logic synthesis resultsfinal logic synthesis results Applied to moderately complex Applied to moderately complex

industrial benchmarks industrial benchmarks

Ongoing WorkOngoing Work Loop TransformationsLoop Transformations

Loop Invariant Code MotionLoop Invariant Code Motion Loop Pipelining HeuristicsLoop Pipelining Heuristics Loop FusionLoop Fusion

High-level Power analysis of High-level Power analysis of transformationstransformations Can Power consumption be reduced Can Power consumption be reduced

despite increased resource utilizationdespite increased resource utilization

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

BB 3

BB 7

+

+

+

c

b

d

+a

Scheduler HeuristicScheduler Heuristic

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

BB 3

BB 7

+

+

+

+

Speculate

c

a

b

d

+

AcrossHTG

AcrossHTG

Speculate

Across HTG

+

AcrossHTG

ConditionalSpeculation

+a1 + a2

ReverseSpeculate

Documents

Center for Embedded Computer Systems University of California, Irvine spark Coordinated Coarse-Grain and Fine-Grain Optimizations