49
Optimization of VHDL Intermediate Forms Keith D. Cooper, John Bennett & Linda Torczon Rice University Q QN D G Q QN D G Q QN S R QN Q S R Optimization New Ideas Impact Schedule 1998 1999 2000 Benchmark Activity Optimization Activity Gather Examples • Work from standard intermediate forms • Develop benchmarks & metrics • Develop and apply analogs of classic code optimizations • Smaller, faster circuits from software specifications • Wider range of application for VHDL • Method for assessing quality of different compilers Develop Metrics Select IF Get Tools IF Measurement Tools Prototype Optimizations Distribute Code Continue Optimization

Impact - Rice University

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Optimization of VHDL Intermediate FormsKeith D. Cooper, John Bennett & Linda Torczon Rice University

Q

QN

D

G

Q

QN

D

G

Q

QN

S

R

QN

Q

S

R

Optimization

New Ideas

Impact Schedule

1998 1999 2000

Benchmark Activity

OptimizationActivity

GatherExamples

• Work from standard intermediate forms• Develop benchmarks & metrics• Develop and apply analogs of classic code optimizations

• Smaller, faster circuits from software specifications• Wider range of application for VHDL • Method for assessing quality of different compilers

DevelopMetrics

Select IF

Get Tools

IF MeasurementTools

PrototypeOptimizations

DistributeCode

ContinueOptimization

Project Vision

Set out to improve the quality of VHDL-derived circuits• View circuit as a graph

→Nodes perform operations

→Edges represent flow of values

• Resembles compiler internal representations from the 60’s & 70’s

Project relies on analogy to compilation• Cells in the design resemble operations in code

• Simple set of values

• Derive knowledge & use it to improve implementation

• Adapt classic compiler optimizations to these graphs

Motivating Example (one last time)

QN

Q

S

R

Instead of this

Q

QN

D

G

Q

QN

D

G

Q

QN

S

R

Produced this design

ENTITY rs IS port (s,r: IN STD_LOGIC := '0';

q,qn: OUT STD_LOGIC);END rs;--ARCHITECTURE behavior OF rs ISBEGIN PROCESS(s,r)

BEGIN -- the ff will not change if r and s are both '0' IF s = '0' AND r = '1' THEN

q <= '0';qn <= '1' ;

ELSIF s = '1' AND r = '0' THENq <= '1' ;qn <= '0' ;

ELSIF s = '1' AND r = '1' THENq <= 'X' ;qn <= 'X' ;

END IF; END PROCESS;END behavior;

This VHDL

We now know more about this simpleflip-flop than anyone should

To try this, we needed a fulcrum• Commercial VHDL systems export & import standard IRs

• Our idea: transform those IRs and re-insert them into system

• Using commercial tools lends credibility & realism to experiment

Project Vision

Prototype Optimizer

Looks like a classiccompiler

CircuitSimulationVendor Tools

CircuitDesign

VHDLCode

Standard IRs

Project Team

Keith D. Cooper• Compiler-based optimization

• Analysis of programs

John Bennett• DSM multiprocessing & performance

• Safety & reliability of latex gloves

Linda Torczon• Back end compiler techniques

• Programming environments & tools

We propose to apply the insights derived in classic scalar compiler research to the problems of compiling the hardware description language VHDL.

Two main subprojects

• Prototype optimizer for some VHDL intermediate representation> choice of IR generated inordinate interest> must be used in both simulation & synthesis

• Collection of VHDL codes to evaluate translation & optimization> small examples to demonstrate correctness & effectiveness> large examples to observe & understand interactions

If we can demonstrate that the analogy to scalar optimization is valid, wewill create a connection to forty years of research in code improvement.

The Big Picture (Chicago PI Meeting, 11/97)

Two distinct world views

• View circuit as a set of logic equations> rearrange & simplify with Boolean algebra> fewer terms ⇒ smaller circuits> shorter paths ⇒ faster circuits

• View circuit as a graph that specifies a computation> analyze flow of values through the graph> transform the graph to improve it> makes “state” explicit & exposed

We will focus on the latter approach• successful optimizer will incorporate both styles• transformations provide framework for simplification

Our Approach (Chicago PI Meeting, 11/97)

Project Schedule

Year 1 Select an IR

Get VHDL & IR tools in house & running

Collect test suite of VHDL Examples

Develop metrics for IR files

Year 2 Build IR measurement tools

Build prototype IR-to-IR transformations

Distribute test suite & measurement tools

Year 3 Continue optimizer development

Distribute prototype optimizer

We are 2.5 years into a 3 year project

Administrative Issues

Fiscal Matters• Contract began 10/97, scheduled run through 9/00

• Have received $670,000; committed $620,000 (out of $751,000)

• Under current plan, could spend most of the money by October

Personnel Changes• John Bennett is leaving Rice in June

→E. Speight & B. Balabanos are also leaving

→The E&CE portion of the team will be gone

• Keith Cooper & Linda Torczon are staying

Road Map for Rest of Talk

Research Progress• Roughly chronological

• Hit the insights rather than the details

Perspective• What have we learned?

• Where should we go from here?

Choosing an IR

A Critical Decision, Made in Year 1• Wanted available commercial tools• Wanted low-level, detailed view of designs• Multiple vendors would be an advantage

Examined EDIF, SAVANT /AIRE, Ocean, & Alliance• None were ideal• Ocean & Alliance deal make strong assumptions (& subset EDIF)• SAVANT /AIRE lacked commercial strength synthesis tools

• EDIF had the best blend of coverage & tools

Selected EDIF

• Expected to find EDIF tools available (EIA standard)• Knew it provided pathways in & out of commercial tools

Focus on analysis &improvement fromlow-level facts, asopposed to source-to-source approach

Reality: available opensource tools handledsmall EDIF subsets

Graph builder instantiated designs from EDIF files• Significant expansion & renaming (EDIF is hierarchical)

• Graph properties measured by walking the graph

We identified several design properties as metrics

Can walk graph and measure most of these ...

Number of gates Number of nets Placer time

Levels of logic Outputs per net Router time

Gates per level Number of cycles Connectedness

Choosing Design Metrics for Evaluation

From Santa Fe Poster

Acquiring and Building the Tools

Lack of open source tools meant much more development• Designed & built parser for EDIF 2.0

• Distributed via WWW, used in other projects

• Needed (& built) many other components

→Graph package, pretty-printer, AST→graph, graph→EDIF

→Value numbering, reordering tool, peephole pass

→Tool to understand libraries & build a database

• More development than we anticipated

• If these have value to others,we will distribute them

About 5,000 linesof specs + C code,becomes 11,000

Parser

Graph

Opt’ns Printer

Prototype Optimizer

Building Prototype Optimizer & Measurement Tools

Designed & built a value numbering tool• Extended Balke’s classic algorithm for boolean circuits

• Assigns a name to each cell output s.t. same name ⇔ same value

• Finds many value identities

Built a small peephole optimization phase• Identifies cascaded operations that can be combined

• Common in EDIF generated from behavioral codes

• Broaden the range of inputs that generate good designs

Measurement tools involve graph traversals• Most built into graph package & infrastructure

• Can pull them out and package them separately

Vendor tools makesame measurements

Gathering Examples

Benchmarks for translator performance• Small examples to illustrate specific principles

• Large examples to get realistic results

Built up a collection of VHDL/EDIF codes• Some developed in-house, some from outside sources

• Include various ACS Program benchmarks

• Do not have permission to redistribute some of them

• Range from 100 to 40,000 lines of EDIF

Have used these extensively in our debugging efforts

Diversion on Place & Route Time

Dr. Munoz asked us to look at place & route problem• Did a small study on the impact of order on place & route time

• Reversing generated order showed small, consistent improvements

• Raised some skepticism, but has a simple rational basis

Our examples are not hard enough to be conclusive• Need larger circuits with very long place & route times

• Need to explore more reordering metrics, along with selectivereplication to reduce connectivity

• Need tools that handle larger examples (more on this later)

Generating the nodes in hierarchical orderplaces the most constrained nodes last

Diversion on Optimizing Mapped Circuits

We were asked about optimizing mapped circuits (4/99)

Developed technique to address this problem

• Analysis to isolate sequential circuits set off by latches

• Extract equations for these sequential portions

• Use external simplifier & resynthesize circuit

• Hand-simulation removed 1 of 18 CLBs from pipelined DFA

• Balabanos is working to improve methods & broaden theirapplicability

Diversion on Optimizing Mapped Circuits

Removed this LUTfrom the DFA

Status Today

Several working tools• Parser, pretty-printer, graph-builder, IR measurements

• Value numbering appears to work

• Still debugging graph→EDIF translation

→Difficulties getting big examples through translation

• Minimal ability to work with larger codes

→Limits place & route work, testing of benchmark codes

Have proved that the ideas, in concept, have merit• Can optimize original motivating example

• Find redundancy in many of the example

• Our analogy to programming languages worked

Stymies work on almostall fronts

Building a full-scale prototype is a huge task

EDIF Issues• Level of detail in EDIF (and in the designs) is enormous

• EDIF encodes much of that detail into vendor-specific libraries

→Requires a deep understanding of libraries

→New examples involve new & obscure corners of the library

• Scaling up entails changing devices and libraries (larger FPGAs)

Debugging Issues• Debugging has been the Achilles’ heel of this project

• Inherent problem in the design of the system

Perspective

Forced to reimplement many of theinternal details of the synthesis tools

Debugging the Optimizer

Prototype Optimizer

CircuitSimulationVendor Tools

CircuitDesign

VHDLCode

Standard IRs

Building the graphmakes the designunrecognizable

If it all works, it works• New EDIF becomes new design

If the optimizer has bugs, ...• It destroyed the name space

• It removed any landmarks

• Vendor tools no longer help

Perspective

This was a high-risk effort• We succeeded in proving the concept

• Prototype works on modest examples

• Prototype does not scale to large examples (details)

To build a full-scale prototype• Should work with internals of an existing commercial system

• Substantial programming effort (4-5 people)

⇒ Better done by a vendor than by an academic project

Vendors are moving in this direction

• Synopsis, for example, has been hiring classical compiler writers

Our options include• Continue on our current path until the end of the contract

→Focus on debugging and scaling the prototype

→May succeed, or may not

• Clean up what is there & make it available on the WWW

→Caveats on prototype & scaling

→Focus on publishing algorithms, problems, & solutions

We’re looking for guidance

Where do we go from here?

Do Not Cross This Line

The following slides are the bits and pieces used to put themain talk together. These are not intended for publicconsumption, and are not part of the main talk. However,they may be useful in any subsequent discussion

The Really Big Picture

If we succeed• Behavioral VHDL might become practical

• Larger pool of designers, larger set of applications

• PC-based FPGA board used for low-volume applications

VHDL generated from other tools becomes more attractive• LabView

• High-level programming languages

• All require better optimization

And, faster place & route times

There are several lessons here.

1) The example turns out to be quite subtle. We need to look at more diverse examples, large and small.

2) The translator should be cautious about mapping indeterminate values to equal values. (In data-flow analysis, a similar issue arises with uninitialized values. We end up representing them with a “top” in lattice theory rather than a “bottom”. Something similar might be useful here.)

3) Hand simulation of value numbering pays off on an appropriate circuit.

4) We need parser to look at larger circuits, such as the FPU.

Lessons

Lessons We’ve Learned

Need full power of “global” algorithms• Cyclic graphs abound

• Regional methods are NOT sufficient

Our “language” analogy holds up moderately well• Value numbering, peephole optimization carry over

• Database on library resembles target-specific knowledge

There is no substitute for studying examples• Understand the weaknesses of translation

• Recognize opportunities for improvement

An Aside about EDIF

Adopted LISP-like syntax model• Parentheses as brackets• Operator immediately follows opening parenthesis• Should simplify parsing

However• Keyword options are syntactically constrained• Many options, many “statements”, each idiosyncratic• Grammar has roughly 1,000 productions

→Twice the size of Fortran, three times the size of C• Throws away many of the benefits of LISP-like syntax

Library Knowledge Base

Optimizer needs to “understand” the library• Encoded in a 2-way hash table• Similar to a relational database• Records # of ports, their names, commutativity, function code• Cell with >1 output needs a hash encoding for value numbering

Building the knowledge base• Version 1 was hand-coded C function (5 to 10 lines per cell type)

• Version 2 automates much of the work• Tool walks designs & generates the “generic” information• Still need to generate some custom information

→Commutativity, hash encodings for multi-output cells→Note which “dead” cells are critical (pads, buffers,…)

Backup Slides on Place & Route

Place & Route Material

Place & Route Time

Many things affect place & route time

Some things we are doing may help, too• Optimization shrinks & simplifies circuits

• Are there other simple tricks we can use?

This suggested a simple experiment• Examine impact of order on place & route time

• Might lead to preprocessor that speeds up place & route

• Experiment is a proof of concept, not a finished work

Caveat: We are not working on place & route algorithms

Place & Route Time

Many heuristic techniques are order-sensitive• Simple experiment - try routing in permuted orders

• Built a tool to reorder presentation of circuit in EDIF

• Generates four orders (original, reverse, ⇑ net size, ⇓ net size)

• Ran a series of codes through test

Goal was to determine efficacy of approach• Any significant variation => should choose an order carefully

• Experiment was a proof of concept

Results suggest that we should investigate better orderings

0

100

200

300

400

500

600

Sca

labi

lity

fft2

Sca

labi

lity

fft4

Sca

labi

lity

fft1

Sca

labi

lity

fft3

top

me

m

Tim

ing

Ver

satil

ity

Inte

rfac

ing

Ero

de

Inte

rfac

ing

Dila

te

fpad

d6

dp14

4x8

m1

1x1

u4

swap

Cap

acity

sma

llsh

ift

ma

ntm

ux

dp 1

6x8

expo

sel3

mu

x4_

10

mu

x4_

5

Codes

Pla

ce

+

Ro

ute

T

ime

(S

eco

nd

s)

Orig

Asc

Des

Rev

Optimizing VHDL Intermediate FormsDARPA/USAFRL/Rice University

February 1999Page 33

These codes may be too small to matter

Place & Route Time

Summary• Experiment shows that order matters

• Need to explore other ordering schemes

• Potential improvement is significant

Future work• Need examples that are hard to place & route

• Need ability to work with different device libraries

→Automatic generation of knowledge base

• Try orders based on locality, on adjacency, on difficulty, …

• Need tools to scale to large examples

Backup Slides on Value Numbering

Value Numbering

Value Numbering Circuits

Have built a prototype value numbering pass• Based on Balke’s 1968 algorithm• Walks graph & assigns an integer to each value (a value number)

• Detects redundancies, knowable values• Need extensions to handle cyclic graphs• Operates on acyclic graphs (basic blocks)

Discovers identical values in linear-time pass over code• Natural framework for algebraic identities

• Recognizes value identity, not lexical identity

• Easily handles commutativity

• Encodes equivalence into hashing operation

5,000 lines of code

Value Numbering Circuits

Have built a prototype value numbering pass• Based on Balke’s 1968 algorithm• Walks graph & assigns an integer to each value (a value number)

• Detects redundancies, knowable values• Need extensions to handle cyclic graphs• Operates on acyclic graphs (basic blocks)

Results:• Can eliminate one D-latch from our RS flip-flop• Finds redundancies in other circuits• Need to make large-scale measurements

Need to build the transformer to rewrite the circuit

Value Numbering Circuits

Balke’s original algorithmFor each instruction i in the block

1. Get vn’s for each operand

2. Hash operand & operators to get i’s vn

3. Already exists => replace i with a reference

4. Operands all constant => evaluate & replace with value

Discovers identical values in linear-time pass over code• Natural framework for algebraic identities

• Recognizes value identity, not lexical identity

• Easily handles commutativity

• Encodes equivalence into hashing operation

Value Numbering Circuits

Required several extensions to classic approach• Propagation of negation

→Use positive & negative integers→ Invert(x) => -x, rather than a new VN→Simplify double negatives arithmetically

• Variable number of inputs & outputs• Features like IN_OUT ports and UNKNOWN ports

Points way to further extensions• Algebraic identities & simplifications• Constant (or known) values• Controlled replication

Work should be of interest to compiler community

Value Numbering Circuits

Need for “optimism”• Allow propagation of values into a cycle• In classic data-flow analysis, add top to the lattice• In value numbering, use dfa minimization algorithm (partitioning)

Use Simpson’s SCC technique• Use two tables, hopeful and proven• Iterate over cycles, hopefully, then record truth• Much faster than partitioning algorithm (1:4:10)

• Allows algebraic identities & simplifications (not in partitioning)

• Uses Balke’s algorithm as the base step

Cycles introduced by designs & by libraries (XC7000 vs. XC4000E)

Value Numbering Circuits

The implementation is a prototype• Uses a hard-coded database for specific target

• Uses a worklist iterative algorithm

• Uses classical (pessimistic) analysis (5,000 lines of code)

It should be refined and (perhaps) rewritten• Deriving database from library specification

• Implement efficient optimism (Simpson’s SCC)

• Build code to re-generate the EDIF

We expect to distribute the code later this year

ENTITY rs ISport (s,r: IN STD_LOGIC := '0'; q,qn: OUT STD_LOGIC);

END rs;--ARCHITECTURE behavior OF rs ISBEGIN

PROCESS(s,r)BEGIN -- the ff will not change if r and s are both '0'

IF s = '0' AND r = '1' THENq <= '0';qn <= '1' ;

ELSIF s = '1' AND r = '0' THENq <= '1' ;qn <= '0' ;

ELSIF s = '1' AND r = '1' THENq <= '1' ;qn <= '0' ;

END IF;END PROCESS;

END behavior;

Prototype pass reduces this to use a single latch

RS Flip-Flop Example

1

2

2

1

-24

3

5

6

6

5

-4

5

6

Backup Slides on Peephole Optimization

Peephole Optimization

Peephole Optimization Pass

Analogy is to instruction selection• Discover rough edges in translation• Clean up leftovers from other passes• Place to apply limited pattern matching• Find operations that can be combined (AND2 & AND2)

• May be place to do re-association

Implementation• Knowledge base about target library• Linear traversal of circuit graph with limited window• Uses logical rather than physical adjacency• Code is quite small and efficient (100s of lines of code)

EDIF Parser

We built & distributed an EDIF parser• At last review, issue of a parser on the EDIF CD

• CD includes an EBNF grammar, not a parser• We evaluated the publicly available parsers, none was acceptable• This took about six weeks

Our parser• Full EDIF 2.0.0 grammar• Builds an abstract syntax tree, serves as basis for tools• 3,400 lines of specification => 11,000 lines of code• Additional 1,500 lines of support code• Available from our web site (Release 2)

Brent Nelson (BYU) is using our parser in his ACS project (active user)

Design Graph

We designed & implemented a design graph• Constructed by walking the AST & instantiating

• Serves as basis for analysis & transformation

• Enables “whole design” analysis and transformation

• Have working implementation

• 4,800 lines of code (1,000 interface + 3,800 low-level details)

• Will distribute more robust version this summer

This will serve as basis for transformations• Used by value numbering & peephole passes

• Used for interim evaluation of design properties

ENTITY rs ISport (s,r: IN STD_LOGIC := ' 0' ; q,qn: OUT STD_LOGIC);

END rs;--ARCHITECTURE behavior OF rs ISBEGIN

PROCESS(s,r)BEGIN -- the ff will not change if r and s are both ' 0'

IF s = ' 0' AND r = ' 1' THENq <= ' 0' ;qn <= ' 1' ;

ELSIF s = ' 1' AND r = ' 0' THENq <= ' 1' ;qn <= ' 0' ;

ELSIF s = ' 1' AND r = ' 1' THENq <= ' X' ;qn <= ' X' ;

END IF;END PROCESS;

END behavior;

This version cannot beoptimized down to 1 latch

(It’s not a flip-flop!)

Original Example

ENTITY rs ISport (s,r: IN STD_LOGIC := '0'; q,qn: OUT STD_LOGIC);

END rs;--ARCHITECTURE behavior OF rs ISBEGIN

PROCESS(s,r)BEGIN -- the ff will not change if r and s are both '0'

IF s = '0' AND r = '1' THENq <= '0';qn <= '1' ;

ELSIF s = '1' AND r = '0' THENq <= '1' ;qn <= '0' ;

ELSIF s = '1' AND r = '1' THENq <= '1' ;qn <= '0' ;

END IF;END PROCESS;

END behavior;

Prototype pass reduces this to use a single latch

Slight Variation

1

2

2

1

-24

3

5

6

6

5

-4

5

6

ENTITY rs ISport (s,r: IN STD_LOGIC := ' 0' ; q,qn: OUT STD_LOGIC);

END rs;--ARCHITECTURE behavior OF rs ISBEGIN

PROCESS(s,r)BEGIN -- the ff will not change if r and s are both ' 0'

IF s = ' 0' AND r = ' 1' THENq <= ' 0' ;qn <= ' 1' ;

ELSIF s = ' 1' AND r = ' 0' THENq <= ' 1' ;qn <= ' 0' ;

ELSIF s = ' 1' AND r = ' 1' THENq <= ' 0' ;qn <= ' 1' ;

END IF;END PROCESS;

END behavior;

Value numbering also finds asingle-latch version of this circuit

The Other Variation