12
Adaptive Computing Systems (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware John Zaino, Eric Pauer, Ken Smith, Paul Fiore, Jairam Ramanathan, Cory Myers {john.c.zaino, ken.smith, paul.d.fiore, cory.s.myers}@lmco.com, [email protected], [email protected] Fourth Biennial Ptolemy Miniconference 22-23 March 2001 03/07/01 Page - 2 BAE SYSTEMS • University of California, Berkeley Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001 ACS Domain DSP Algorithms in Reconfig. HW Automatic Implementation Objective/Approach/Process Reconfigurable computing technology offers significant performance gains, e.g. 10X ops per watt and/or ops per cubic inch, over general purpose programmable solutions without the need to develop custom hardware. Today however, development of a working implementation requires hardware design expertise and generation of a good implementation requires many slow iterations between an algorithm designer and a hardware developer. Objective - reduce the design time for an initial implementation to hours and for an optimized implementation to days, for a range of signal processing applications Approach - provide the algorithm developer with tools to help analyze algorithms, understand their implications for hardware, and rapidly implement their chosen solutions In the process, isolate the algorithm developer from the hardware designer through a set of library elements that provide well-defined interfaces to both communities Direct mapping of algorithm to adaptive computing system implementation.

Adaptive Computing Systems (ACS) Domain for Implementing

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Adaptive Computing Systems (ACS) Domain for Implementing

Adaptive Computing Systems (ACS)Domain for Implementing DSP

Algorithms inReconfigurable Hardware John Zaino, Eric Pauer, Ken Smith, Paul Fiore,

Jairam Ramanathan, Cory Myers{john.c.zaino, ken.smith, paul.d.fiore, cory.s.myers}@lmco.com,

[email protected], [email protected]

Fourth Biennial Ptolemy Miniconference22-23 March 2001

03/07/01 Page - 2BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Automatic Implementation

Objective/Approach/Process• Reconfigurable computing technology offers significant performance gains, e.g. 10X

ops per watt and/or ops per cubic inch, over general purpose programmablesolutions without the need to develop custom hardware. Today however,development of a working implementation requires hardware design expertise andgeneration of a good implementation requires many slow iterations between analgorithm designer and a hardware developer.

• Objective - reduce the design time for an initial implementation to hours and for anoptimized implementation to days, for a range of signal processing applications

• Approach - provide the algorithm developer with tools to help analyze algorithms,understand their implications for hardware, and rapidly implement their chosensolutions

• In the process, isolate the algorithm developer from the hardware designer through aset of library elements that provide well-defined interfaces to both communities

Direct mapping of algorithmto adaptive computing system

implementation.

Page 2: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 3BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Technical Attributes• Development of Adaptive Computing Systems domain under Ptolemy

Classic– Allows alternative implementations from same dataflow graph– Provides floating point simulation, fixed point simulation, C code generation and

VHDL code generation– Released first three versions of ACS domain in Ptolemy Classic

• End-to-end capability to map signal processing dataflow graph to workingreconfigurable computing implementation

• Design space exploration automated– Bit width optimization theory (Markovian modeling) developed for algorithm

analysis– Bit width optimization tool implemented to trade signal to noise ratio versus

hardware complexity

• Pipeline alignment and scheduling algorithms implemented– Automatically generate “algorithm-specific” sequencer and memory control logic

– Uni-rate and multi-rate signal processing– Single and multi-FPGA implementations

• Smart Generators- parameterizable algorithmic blocks

03/07/01 Page - 4BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Analysis and Mapping in ACS Environment

Floating PointSimulation

Fixed PointSimulation

Bit Width AnalysisNoise Distribution Analysis

AlgorithmRearrangement

Dataflow Graph

Algorithm Analysis

PerformanceModeling

AutomaticScheduling

Algorithm Mapping

Smart Generators

Partitioning andMapping

GeneratorSelection

VHDL Interface LibrariesAdaptive Computing

Resource

Precision AnalysisAlternative Implementations

Dataflow Graph

Performance Metrics

Device Programming

Allocated Functions

• SNR analysis• Alternative

implementations• Functional

approximations

• Timing and sizingestimation

• Scheduling• Partitioning

across multipleFPGAs

• Device program• Interface program

CommonDatabase

inPtolemy

Page 3: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 5BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Design Approach

Device Driver

Dataflow Graph

Device Program

HardwareConfiguration Library

Host Processor

Application InterfaceGeneration

Represent in Dataflow

Analysis and Simulation

Bit WidthAnalysis Noise

DistributionAnalysis

AlgorithmRearrangement

Algorithm Analysis

PerformanceModeling

Algorithm Mapping

Smart Generators

Partitioning &Mapping

Precision AnalysisAlternative

ImplementationsSignal Flow Graph

Performance Metrics

Allocated Functions

Floating PointSimulation

Fixed PointSimulation

AutomaticScheduling

CommonDatabase

inPtolemy

Generator Selection

VHDL Interface Libraries

Signal Processing Algorithm

Ptolemy Environment

Compute LibrariesOperating System

Run-Time Manager

ApplicationSoftware

Logic Generation

Floorplanner

RoutingEnhanced or New Capability

Existing Tool or Hardware

Legend

ReconfigurableHardware

Run TimeDesign Time

03/07/01 Page - 6BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Program Progress• Algorithm analysis

– Representation for alternative implementations was incorporatedas part of Ptolemy integration

– Side-by-side simulation capability incorporated as part ofPtolemy integration

– Developed bit width optimization theory for algorithm analysisand extended to include multiple devices and constraints

– Implemented wordlength optimization tool

• Algorithm mapping– Cost analysis included as part of wordlength analysis– Implemented uni-rate and multi-rate pipeline alignment and

scheduling algorithm for signal processing dataflow graphs– One-to-one and one-to-many mapping of functions to blocks

supported

Page 4: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 7BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Program Progress• Smart generators

– Implemented portable logic synthesis methodology with VHDL as first target– Integrated Xilinx Core 4,000-series generators capability within VHDL code generation– Implemented smart generators for state machine sequencer and memory control (address

generator)

• Ptolemy integration– Released initial version of “Adaptive Computing Systems” domain in Ptolemy in June 1998.

Second release April 1999. Third release in August 2000.– ACS domain supports alternative implementations from a common interface. Floating point

simulation, fixed point simulation, C code generation, and VHDL code generation.

• Demonstration– Selected Annapolis Micro Systems WildforceTM board for demonstrations– Established ACS demonstration environment for Solaris– Integrated WildforceTM board under Ptolemy– Demonstrated Winograd-based FSIC receiver and FFT-based signal detector– Procured and installed Annapolis Micro systems WildstarTM board under Solaris– SHARP/HRR (High Range Resolution Radar ATR) algorithm modeled - hardware development

& testing nearing completion

03/07/01 Page - 8BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

ACS Domain• Determined that extending old domains could not be justified• New paradigm for Ptolemy, e.g. multiple implementations of a single star

Page 5: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 9BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

ACS Domain• New ACS domain to facilitate movement among simulation

and code/design generation• Corona contains interface specification• Core contains an implementation• ACS Stars are composed of one corona and multiple cores• Core selection via targeting defines implementation

Fixed_PointSimulation

Core

C CodeGeneration

Core

FPGA DesignGeneration

Core

Floating_PointSimulation

Core

Corona

TargetsCorona Core

03/07/01 Page - 10BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Selecting Among Alternative Implementations

• Alternative implementations are represented as “targets” with coresfor each star/functional block

• Targets can have parameters• Floating point simulation, fixed point simulation, C code generation,

and FPGA design generation are available.

Page 6: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 11BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Algorithm Analysis

ImplementationTrades

...

...

Coeffs Data

Acc1 Acc2 Acc3

Yn =a0 Xn +a1 Xn - 1+a2Xn - 2

Yn + 1=a0 Xn + 1+a1 Xn +a2Xn - 1

Yn + 2=a0 Xn + 2+a1 Xn + 1+a2 Xn

...P ( z )

Q u a n t i ze

L o o p F i l t e r

......

FAFA...

FA

FAFA...

FA

... ...

Basic FIR

Reduce Wordlength

Reduce Taplength

Systolic

Low Power

Bit Serial

Multirate

MultipleRepresentations

Perform Trade-Offs• Precision (float vs.

fixed, wordlengths)• Speed• Size/Area• Latency

1z

...IQ

N/3 DELAYS

...1z

1z

1z

1z

1z

1z

2N/3 DELAYS

Angle

Freq.Est.

IQ

1z

1z

N DELAYS

Angle

Freq.Est.

1z

...

N MultsN Adds

1 Scaling7 Adds

AlgorithmAnalysis

03/07/01 Page - 12BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Wordlength Optimization Analysis

QuantizationNoise(SNR)

Hardware Cost

DynamicRange

OptimalDesign

Choices

Page 7: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 13BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Algorithm Mapping• Objectives

– Performance Modeling – provide feedback on utilization, throughput, efficiency, etc.Feedback should be used by algorithm analysis capabilities.

– Partitioning and Mapping – break large dataflow graphs into groups and map those groupsacross multiple devices and across time

– Automatic Scheduling – automatically determine firing sequence, optimal mappings andsequence of configurations

• Progress– Cost analysis included as part of wordlength analysis– Implemented uni-rate and multi-rate pipeline alignment and scheduling algorithms– Memory allocation support

PerformanceModeling

Partitioning andMapping

AutomaticScheduling

Signal Flow Graph

CommonDatabase

inPtolemy Performance Metrics

Allocated Functions

03/07/01 Page - 14BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

N8

PORT1 PORT2

I2

P=2

I3

P=1

I4

P=1

I5

P=1

I6

P=1

N1

N2

N3

N4

N5

N6

N7

A

B

C

D

E

THE ALGORITHM DATAFLOW GRAPH

I = Instance N=NodeP=Pipeline Delays

DATAPATH AND VARIABLE LOCATIONS

ADDED TO NETLIST BY SEQUENCER GENERATOR

MODIFIED ALGORITHM DATAFLOW GRAPH

FINAL ALGORITHM SCHEDULE

Input Output

Pipeline alignment and schedule determinationrequired for logic synthesis

Node N1 N2 N3 N4 N5 N6 N7 N8 N9 SEL LD1 LD2 LD3 PORT1 PORT2

Activation Sequence

N8

I2

I3P=1

I4P=1

I5P=1

I6P=1

N1

N2

N3

N4

N5

N6

N7

I7P=1

N9

LDEN1

LDEN2

LDEN2

MEM1MEM2

2-1MUX

DELAY

SEL

P=2

FPGA

RAMBANK 1

A

B

C

RAMBANK 2

D

E

Automatic Scheduling

Page 8: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 15BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Processing Model• Well-matched to Ptolemy Synchronous Dataflow (SDF) Domain

– Unit or block token produce and consume amounts– Netlist structure determines execution order constraints– Pipeline delay information required to determine absolute timing

– Delays are set to align pipelines for maximum throughput– Delay can be automatically determined from block parameters

• Combination of fully synchronous model and tagged synchronous models– No handshaking or tags but data is not always valid– Data validity is implicit in timing of latch signals

• Memory access fits same model– Data from common memory demuxed into separate streams running at lower rate– Data to common memory multiplexed to a single port

• Multiple FPGAs introduce additional pipeline delays• Multi-rate parameterized execution

03/07/01 Page - 16BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Smart Generators• Objectives

– Parameterized libraries – generate node implementations for specified bit widths andparameter values

– Hierarchical representations – provide generators that can recursively call other generators– Interface generation – automatically generate software to move data between general-

purpose processor and reconfigurable platform and to manage sequences of configurations– General synthesis – provide device independent representation of implementation

• Progress– Implemented portable logic synthesis methodology with VHDL as first target– Integrated Xilinx Core Generators (4000 Series) capability within VHDL code generation– Implemented smart generators for state machine and memory control– Hierarchical generation

GeneratorSelection

VHDL Interface LibrariesAdaptive

ComputingResource

DeviceProgramming

Allocated Functions

CommonDatabase

inPtolemy

Page 9: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 17BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Multi-FPGA CapabilityDesign Generation for Single or Multiple FPGAs

FPGA 1Logic

FPGA 3Logic

FPGA 1Routing

FPGA 2Routing

FPGA 3Routing

FPGA 4Routing

SingleFPGA

Implementation

FPGA 1Logic

Multi-FPGAImplementation

03/07/01 Page - 18BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

FPGAImplementation

Winograd DFT-Based FSK Communications Receiver

Page 10: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 19BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Results from FPGA-target / Back-end Tools

Generated VHDL Generated Schedule FPGA Design

03/07/01 Page - 20BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

SDF WildforceTM Star executes complete FPGA designin hardware on Annapolis Wildforce FPGA board

SDF Galaxy

Hardware-in-the-Loop

Page 11: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 21BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

Processing Results

03/07/01 Page - 22BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

SHARP*/HRR Algorithm

Algorithm• Given test vector

– For each template• For each shift

– Compute least squares error

• Select template with minimum error

Complexity• 70 data points per vector• Number of shifts = 11 (in range)• Number templates = 3,600/class• 186 sec/class for 11 shifts on a Sun

Ultra 5 (360 MHZ) workstation• Expect 30x improvement

Non-Linearity

LeastSquares Fit

TestVector

ModelingError

ShiftTemplate Vector

(one per target, per azimuth,per elevation)

Can be done with correlation if templates aresuitably pre-processed

* System-oriented High Range Resolution(HRR) Automatic Recognition Program

Page 12: Adaptive Computing Systems (ACS) Domain for Implementing

03/07/01 Page - 23BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

SHARP/HRR Algorithm

FPGA DesignFPGA DesignSchedule Schedule

NormalizationResults

(HW vs. SW)

Correlation ResultsAcross Range Shifts

(typical expected)

NORMALIZATIONCORRELATION

03/07/01 Page - 24BAE SYSTEMS • University of California, Berkeley

Fourth Biennial Ptolemy Miniconference – 22-23 Mar 2001

ACS Domain DSP Algorithms in Reconfig.HW

ACS Tools - Facts and Figures• 23 Functional Blocks (ACS stars) developed• ~100 lines of code needed for new block/star• Two ACS Architectures supported

– WildforceTM (4062XL)– WildstarTM (XCV1000) (in progress)

• ~16,000 lines of C++ code developed• ~15 min to generate VHDL for five FPGA design• Explore ~1000 bit-width combinations in 1 minute• Ptolemy Classic runs under Solaris and Linux