Design Automation for Streaming Systemsbrass.cs.berkeley.edu/documents/eylon_thesis/thesis-talk.pdf · •FIFO buffered channel ... •Encode EOS E as extra D bit (out of band, easy

Design Automationfor Streaming Systems

Eylon CaspiUniversity of California, Berkeley

12/2/05

IA IB

OA OB

12/2/05 Eylon Caspi 2

Outline

♦ Streaming for Hardware

♦ From Programming Model to Hardware Model

♦ Synthesis Methodology for FPGA• Streams, Queues, SFSM Logic

♦ Characterization of 7 Multimedia Apps

♦ Optimizations• Pipelining, Placement, Queue Sizing, Decomposition


Large System Design Challenges

♦ Devices growing with Moore’s Law

• AMD Opteron dual core CPU: ~230M transistors

• Xilinx Virtex 4 / Altera Stratix-II FPGAs: ~200K LUTs

♦ Problems of DSM, large systems

• Growing interconnect delay, timing closure

♦ “Routing delays typically account for 45% to 65% of the total path delays” (Xilinx Constraints Guide)

• Slow place-and-route

• Design complexity

• Designs do not scale well on next gen. device; must redesign

♦ Same problems in FPGAs


Limitations of RTL

♦ RTL = Register Transfer Level

♦ Fully exposed timing behavior• always @(posedge clk) ...

♦ Laborious, error prone

♦ Unpredictable interconnect delay• How deep to pipeline?

• Redesign on next-gen device

♦ Undermines reuse

♦ Existing solutions• Modular design • Floorplanning

• Physical synthesis • Hierarchical CAD

• Latency insensitive communication


Streams

♦ A better communication abstraction

♦ Streams connect modules• FIFO buffered channel (queue)• Blocking read• Timing independent (deterministic)

♦ Robust to communication delay• Pipeline across long distances• Robust to unknown delay

♦ Post-placement pipelining♦ Alternate transport (packet switched NOC)

♦ Flexibly timed module interfaces• Robust to module optimization (pipeline, reschedule, etc.)

♦ Enhances modular design + reuse

Memory Module(compute)

Stream


Streaming Applications

♦ Persistent compute structure (infrequent changes)

♦ Large data sets, mostly sequential access

♦ Limited feedback

♦ Implement with deep,system level pipelining

♦ E.g. DSP, multimedia

♦ JPEG Encoder:

12/2/05 6


Ad Hoc Streaming

♦ Every module needs streaming flow control• Block if inputs not available, output not ready to receive

♦ Every stream needs queueing• Pipeline to match interconnect delay

• Queue to absorb delay mismatch, dynamic rates

♦ Manual implementation, in HDL• Laborious (flow control, queues)

• Error prone (deadlock if violate protocol, queue too small)

• No automation (pipeline depth, queue choice / width / depth)

♦ Interconnect / queue IP (e.g. OCP / Sonics Bus)• Still no automation


Systematic Streaming

♦ Strong stream semantics: Process Networks• Stream = FIFO channel with (flavor of) blocking read

• E.g. Kahn Process Networks,E.g. Dataflow Process Networks (E.A.Lee)

♦ Streams as programming primitive• Language support hides flow control

♦ Compiler support• Compiler generated flow control

• Compiler controlled pipelining, queue depth, queue impl.

• Compiler optimizations (e.g. module merging, partitioning)

♦ Benefits

• Easy, correct, high performance • Portable

• Paging / Virtualization is a logical extension (Automatic page partitioning)


Outline







SCORE Model

♦ Application = Graph of stream-connected operators

♦ Operator = Process with local state

♦ Stream = FIFO channel, unbounded capacity, blocking read

♦ Segment = Memory, accessed via streams

♦ Dynamics:• Dynamic I/O rates

• Dynamic graph construction(omitted in this work)

Segment

Operator(SFSM)

Stream

Stream ComputationsOrganized forReconfigurable Execution


SCORE Programming Model: TDF

♦ TDF = behavioral language for• SFSM Operators (Streaming Extended Finite State Machine)

• Static operator graphs

♦ State machine for

• Firing control • Sequencing, branching

♦ Firing semantics• In state X, wait for X’s inputs, then evaluate X’s action

state foo (i, j): o = i + j; goto bar;}

i j

o


SCORE / TDF Process Networks

♦ A process from M inputs to N outputs,unified stream type S (i.e. SM→SN)

♦ SFSM = (Σ, σ0, σ, R, fNS, fO)• Σ = Set of states• σ0 ∈ Σ = Initial state• σ ∈ Σ = Present state• R ⊆ (Σ × SM) = Set of firing rules• fNS : R→Σ = Next state function• fO : R→SN = Output function

♦ Similar to dataflow process networks[Lee+Parks, IEEE May ‘95],but with stateful actors


Related Streaming Models

♦ Streaming Models• Kahn PN, DFPN, BDF, SDF, CSDF, HDF,

StreamsC, YAPI, Catapult C, SHIM

♦ Streaming Platforms• Pleiades, Philips VSP, Imagine, TRIPS

♦ How do we differ?• Stateful processes

• Deterministic

• Dynamic dataflow rates (FSM nodes)

• Direct synthesis to hardware

• Bounded Buffers


Streaming Platforms

♦ FPGA (this work)

♦ Paged FPGA• Page = fixed size partition,

connected by streams• Stylized page-to-page interconnect• Hierarchical PAR

♦ Paged, Virtual FPGA (SCORE)• Time shared pages• Area abstraction (virtually large)

♦ Multiprocessor on Chip

♦ Heterogeneous


The Compilation Problem

Programming Model: TDF Execution Model: FPGA

• Communicating SFSMs • Single circuit / configuration

- unrestricted size, # IOs, timing - one or more clocks

• Unbounded stream buffering • Fixed size queues

Compile

memorysegment

TDFoperator

stream Big semantic gap

FPGA


The Semantic Gap

♦ Semantic gap between TDF, HW

♦ Need to bind:• Stream protocol

• Stream pipelining

• Queue implementation

• Queue depths

• SFSM synthesis style (behavioral synthesis)

• Memory allocation

• Primary I/O

♦ SCORE device binds some implementation decisions(custom hardware), raw FPGA does not

♦ Want to characterize cost of implementation decisions

Compile FPGA


Outline







Compilation Tool Flow

tdfc

Synplify

Xilinx ISE

Application

Bits

Verilog

EDIF(Unplaced LUTs, etc.)

Device Configuration

• Local optimization• System optimization

• Queue sizing• Pipeline extraction• SFSM partitioning / merging• Pipelining

• Generate flow ctl, streams, queues

• Behavioral Synthesis• Retiming

• Slice packing• Place and route

TDF


Wire Protocol for Streams

♦ D = Data, V = Valid, B = Backpressure

♦ Synchronous transaction protocol• Producer asserts V when D ready, Consumer deasserts B when ready• Transaction commits if (¬B ∧ V) at clock edge

• Encode EOS E as extra D bit (out of band, easy to enqueue)

Producer Consumer

D (Data), E (EOS)

V (Valid)

B (Backpressure)

D

V

B

Clk


Operator Firing

♦ In state X, fire if• Inputs desired by X are ready (Valid, EOS)

• Outputs emitted by X are ready (Backpressure)

♦ Firing guard / control flow

• if (iv && !ie && !ob) begin ib=0; ov=1; ...end

♦ Subtlety: master, slave• Operator is slave

♦ To synchronize streams, (1) wait for flow control in, (2) fire / emit out

• Connecting two slaves would deadlock

• Need master (queue) between every pair of operators

Opid,eivib

od,eovob

12/2/05 21

SFSMSynthesis

Control

Data registers

Stream I/O

DatapathFor State 1

DatapathFor State 2

FSM

V E DB

V E DB

Datapath

♦ Implemented asBehavioralVerilog, usingstate ‘case’ inFSM and DP

♦ FSM handlesfiring control,branching

♦ FSM sends stateto DP

♦ DP sends bool.flags to FSM


FSM Module, Firing Control

foo (input unsigned[16] x, input unsigned[16] y, output unsigned[16] o){ state one (x, eos(y)) : o=x+1; ...}

module foo_fsm (clock, reset, x_e, x_v, x_b, y_e, y_v, y_b, o_e, o_v, o_b, state, statecase); ... always @* begin x_b_=1; y_b_=1; o_e_=0; o_v_=0; state_reg_ = state_reg; statecase_ = statecase_stall; did_goto_ = 0;

case (state_reg) state_one: begin if (x_v && !x_e && y_v && y_e && !o_b) begin statecase_ = statecase_1; x_b_=0; y_b_=0; o_v_=1; o_e_=0; end ... end // always @*endmodule // foo_fsm

TDF:

Verilog

FSM

Module:Default is stall

Firing condition(s)for state one

Stream flow ctlfor state one


Data-Path Module

foo (input unsigned[16] x, input unsigned[16] y, output unsigned[16] o){ state one (x, eos(y)) : o=x+1; ...}

module foo_dp (clock, reset, x_d, y_d, o_d, state, statecase); ... always @* begin o_d_=16’bx; did_goto_ = 0;

case (state) state_one: begin if (statecase_ == statecase_1) begin o_d_ = (x_d + 1’d1); end ... end // always @*endmodule // foo_dp

TDF:

Verilog

Data-path

Module:

Default is stall

Firing condition(s)for state one

Data-pathfor state one


Stream Buffers (Queues)

♦ Systolic• Cascade of depth-1

stages (or depth-N)

♦ Shift register• Put: shift all entries• Get: tail pointer

♦ Circular buffer• Memory with

head / tail pointers


Enabled Register Queue

iDiViB

oDoVoB

en

♦ Systolic, depth-1 stage

♦ 1 state bit (empty/full) = V

♦ Shift in data unless:• Full and downstream not ready

to consume queued element

♦ Area 1 FF per data bit• On FPGA 1 LUT cell per data bit

• Depth-1 (single stage) nearly free,since FFs pack with logic

♦ Speed: as fast as FF• But combinationally connects producer + consumer via B


Xilinx SRL16

♦ SRL16 = Shift register of depth 16in one 4-LUT cell

• Shift register of arbitrary width: parallel SRL16,arbitrary depth: cascade SRL16

♦ Improve queue density by 16x

SRL16 Mode4-LUT Mode


Shift Register Queue

♦ State: empty bit +capacity counter

♦ Data stored in shift register• In at position 0• Out at position Address

♦ Address = number of stored elements minus 1

♦ Synplify infers SRL16Efrom Verilog array• Parameterized depth, width

♦ Flow control• ov = (State==Non-Empty)• ib = !(Address==Depth-1)

♦ Performance improvements• Registered data out• Registered flow control• Specialized, pre-computed

fullness

+1 -10

Shift Regen

=Depth-1 =0

FSM

Empty

Non-Empty

Address

iB iV iD,E

oB oV oD,E

full

zero


SRL Queue with Registered Data Out

♦Registered data out• od (clock-to-Q delay)• Non-retimable

♦Data output registerextends shift register

♦Bypass shift registerwhen queue empty

♦3 States

♦Address = numberof stored elementsminus 2

♦Flow control• ov = !(State==Empty)• ib = (Address

==Depth-2)

+1 -10

Shift Regen

=Depth-2 =0

Address

Data Out

One

More

Empty

iB iV iD,E

oB oV oD,E

fullzero


SRL Queue with Registered Flow Ctl.

♦Registered flow ctl.• ov (clock-to-Q delay)• ib (clock-to-Q delay)• Non-retimable

♦Flow control• ov_next = !(State_next

==Empty)• ib_next =

(Address_next ==Depth-2)

♦Based on pre-computed fullness• full_next =

(Address_next ==Depth-2)

+1 -10

Shift Regen

=Depth-2 =0

Address

Data Out

One

More

Empty

iB iV iD,E

oB oV oD,E

fullzero

full_next


SRL Queue with Specialized,Pre-Computed Fullness

♦ Speed up critical fullpre-computation byspecial-casing full_nextfor each state

♦ Flow control• ov_next = !(State_next

==Empty)• ib_next = full_next

♦ zero pre-computation isless critical

♦ Result• >200MHz unless very

large (e.g. 128 x 128)

• All output delays areclock-to-Q

• Area ≈ 3 x (SRL16E area)

+1 -10

Shift Regen

=Depth-3 =0

Data Out

One

More

Empty

iB iV iD,E

oB oV oD,E

full

zero

Address


SRL Queue Speed


SRL Queue Area


Page Synthesis

♦ Page = Cluster of Operator(s) + Queues

♦ SFSMs• One or more per page

• Further decomposed intoFSM, data-path

♦ Page Input Queues• Deep

• Drain pipelined page-to-page streams beforereconfiguration

♦ In-page Queues• Shallow

♦ Separately SynthesizableModules• Separate characterization

• Consider custom resources

PageInput

Queue(s)Queue

Op 2Datapath

Op 1Datapath

Op 1FSM

Op 2FSM


Page Synthesis

PageInput

Queue(s)Queue

Op 2Datapath

Op 1Datapath

Op 1FSM

Op 2FSM

♦ Module Hierarchy

• Local / output queues

• Individual SFSMs(combinational cores)

• Operators andlocal / output queues

• Input queues

• Page


Outline







Tool Flow, Revisited

tdfc

Synplify

Xilinx ISE

Application

Bits

Verilog

EDIF(Unplaced LUTs, etc.)

Device Configuration

♦ Separate compilationfor application, SFSMs

• Page • SFSM

• Datapath • FSM• Identical queuing for every stream (SRL16 based, depth 16)

• I/O boundary regs (for Xilinx static timing analysis)

• Synplify 8.0

• Target 200MHz

• Optimize: FSM, retiming, pipelining

• Retain monolithic FSM encodings

• ISE 6.3i

• Constrain to minimum square area, at least max slice packing + 20%, expand if fail PAR

Tool Options

• Device: XC2VP70 -7

37

PAR Flow for Minimum Area

♦ EDIF netlist from Synplify

♦ Constraints file• Page area• Target Period

♦ ngdbuild:• Convert netlist EDIF → NGD

♦ map:• Pack LUTs, MUXes, etc. into slices

♦ trce: (pre-PAR)• Static timing analysis, logic only

♦ par:• Place and route

♦ trce: (post-PAR)• Static timing analysis

EDIF

ngdbuild

map

par

trce

Constraints

Ok?

Ok?

no

no

yes

yes

trce

Targetpackedslices

Target1 extrarow/col

Targetpackedtiming


SCORE Applications

♦ 7 Multimedia Applications / 279 Operators

• MPEG, JPEG, Wavelet, IIR • Written by Joe Yeh

• Mostly feed-forward streaming

• Constant consumption / production ratios,except compressors (ZLE, Huffman)

Speed Area %Area %Area

Application SFSMs Segments In Local Out (MHz) (4-LUT cells) FSMs Queues

IIR 8 0 1 7 1 166 1,922 3.4% 27.7%

JPEG Decode 9 1 1 41 8 47 7,442 7.0% 28.7%

JPEG Encode 11 4 8 42 1 57 6,728 7.5% 36.9%

MPEG Encode IP 80 16 6 231 1 47 41,472 5.5% 39.7%

MPEG Encode IPB 114 17 3 313 1 50 65,772 5.2% 40.5%

Wavelet Encode 30 6 1 50 7 106 8,320 10.1% 32.0%

Wavelet Decode 27 6 7 49 1 109 8,712 8.5% 29.6%

Total 279 50 27 733 20 140,368 5.9% 38.3%

Streams


Page Area

♦ 87% of SFSMs are smaller than 512 LUTs — by design

♦ FSMs small ♦ Datapaths dominate in most large pages

DCT, IDCT


Page Speed

♦ FSMs (flow control) are fast, never critical

♦ Queues are critical for 1/3 fastest pages ♦ Datapaths dominate

43% 47% 10%


Outline







Improving Performance, Area

♦ Local (module) Optimization• Traditional compiler optimization

♦ (const folding, CSE, etc.)

• Datapath pipelining / loop scheduling• Granularity transforms

♦ (composition / decomposition)

♦ System Level Optimization• Interconnect pipelining• Shrink / remove queues• Area-time transformations

♦ (rate matching, serialization, parallelization)


Pipelining With Streams

♦ Datapath pipelining• Add registers at output (or input)

• Retime into place

♦ Harder in practice (FSM, cycles)• Add registers at strategic locations

• Rewrite control

• Avoid violating communication protocol

♦ Stream pipelining• Add registers on streams

• Retime into datapath

• Modify queues, not processes

FSM

DP

DP

FSM

DP


Logic Pipelining

ProducerQueue

withL Reserve

D (Data)

V (Valid)

B (Backpressure)

Retime

Consumer

D

V

B

♦ Add L pipeline registers to D, V

♦ Retime backwards• This pipelines feed-forward parts of producer’s data-path

♦ Stale flow control may overflow queue (by L)

♦ Modify queue to emit back-pressure when empty slots ≤ L

♦ No manual modification of processes


Logic Relaying + Retiming

♦ Break-up deep logic in a process

♦ Relay through enabled register queue(s)

♦ Retime registers into adjacent process• This pipelines feed-forward parts of process’s datapath

• Can retime into producer or consumer

♦ No manual modification of processes

ProducerOriginalQueue

D

V

B

Consumer

D

V

B

D

V

B

en

Retime


Benefits, Limitations

♦ Benefits• Simple to implement, relies only on retiming

• Sufficient for many cases, e.g. DCT, IDCT

♦ Limitations• Feed-forward only (weaker than loop sched.)

• Resource sharing obfuscates retiming opportunities

♦ Extends to interconnect pipelining• Do not retime into logic — register placement only

• Also pipeline B, modify queue


Pipelining Configuration

♦ Pipeline depth parameters: Li+Lp+Lr

♦ Uniform pipelining: same depths for every stream

Queuewith Reserve

Lpen

SFSMen

D

V

B

D

V

B

D

V

B

D

V

B

D

V

B

D

V

B

Input sideLogic Relaying

Li

Output sideLogic Relaying

Lr

LogicPipelining

Lp

Retime Retime

Retime


Speedup from Logic Pipelining

Enabled Regs (Lr) D FFs (Lp)


Expansion from Logic Pipelining



Some Things Are Better Left Unpipelined

♦ Pagespeedup:

♦ Pageexpansion:

♦ Initially fastpages shouldnot be pipelined


Page Specific Logic Pipelining

♦ Separate pipelining of each SFSM

♦ Assumption:application speed = slowest page speed• Critical Page

♦ Repeatedly improve slowest pageuntil no further improvement is possible

♦ Page improvement heuristics• Greedy Lr : Add one level of pipelining in 0+0+Lr

• Greedy Lp : Add one level of pipelining in 1+Lp+0

• Max : Pipeline to best page speed (brute force)

♦ Greedy heuristics may end early• Non-monotonicity: adding a level of pipelining may slow page


Speedup from Page Specific



Expansion from Page Specific



Interconnect Delay

♦ Critical routing delay grows with circuit size• Routing delay for an application: avg. 45% - 56%• Routing delay for its slowest page: avg. 40% - 50%• Ratio (appl. to slowest page): avg. 0.99x - 1.34x

♦ Averaged over 7 apps / varies with logic pipelining

♦ Modular design helps• Retain critical routing delay of page, not application• Page-to-page delays (streams) can be pipelined


Interconnect Pipelining

ProducerQueue

with2W Reserve

D (Data)

V (Valid)

B(Backpressure)

Consumer

D

V

B

♦ Add W pipeline registers to D, V, B• Mobile registers for placer • Not retimable

♦ Stale flow control may overflow queue (by 2W)• Staleness = total delay on B-V feedback loop = 2W

♦ Modify downstream queue to emit back-pressure whenempty slots ≤ 2W

Long distance


Speedup from Interconnect Pipelining


Speedup from Interconnect Pipelining,No Area Constraint


Expansion from InterconnectPipelining, No Area Constraint


Interconnect Register Allocation

♦ Commercial FPGAs / tool flows• No dedicated interconnect registers

• Allocation: add to netlist, slice pack, place-and-route

• If pack registers with logic limited register mobility

• If pack registers alone area overhead

♦ Better: Post-placement register allocation• Weaver et al., “Post-Placement C-Slow Retiming for the Xilinx

Virtex FPGA,” FPGA 2003

• Allocation: PAR, c-slow, retime, scavenge registers, reroute

• No area overhead (scavenge registers from existing placement)

• Better performance, since know routing delay

• Modification for streaming:♦ PAR, pipeline, retime, scavenge registers, reroute,

modify queue depths (configuration specialization)


Throughput Modeling

♦ Pipelining feedback loops may reducethroughput (tokens per clock period)• Which loops / streams are critical?

♦ Throughput model for PN• Feedback cycle C with

M tokens, N pipe delays,has token period: TC = M/N

• Overall token period: T = maxC {TC }

• Available slack: CycleSlackC = (T - TC)

• Generalize to multi-rate, dynamic rate by unfoldingequivalent single-rate PN

TC1 = 3

TC2 = 2


Throughput Aware Optimizations

♦ Throughput aware placement• Adapt [Singh+Brown, FPGA 2002]

• Stream slack: Te = maxC s.t. e∈C {TC }

• Stream net criticality: crit = 1 - ((T - Te) / T)

♦ Throughput aware pipelining• Pipeline stream w/o exceeding slack

• Pipeline module s.t. depth does not exceed any outputstream slack

♦ Pipeline balancing (by retiming)

♦ Process Serialization• Serial arithmetic for process with low throughput, high slack


Stream Buffer Sizing

♦ Fixed size buffers in hardware• For minimum area (want smallest feasible queue)

• For performance (want deep enough to avoid stalls from producer-consumer timing mismatch)

♦ Semantic gap• Buffers are unbounded in TDF,

bounded in HW

• Small buffer may create artificial deadlock (bufferlock)

• Theorem: memory boundis undecidablefor a Turing completeprocess network

• In practice, our buffering requirements are small

=x

=y

x=y=

x

y

=x

=y

x=y=

x

y

Bounded

Unbounded


Dealing with Undecidability

♦ Handle unbounded streams

• Buffer expansion [Parks ‘95]

♦ Detect bufferlock, expand buffers

• Hardware implementation

♦ Buffer expansion = rewire to another queue

♦ Storage in off-chip memory or queue bank

♦ Guarantee depth bound for some cases

• User depth annotation

• Analysis

♦ Identify compatible SFSMs with balanced schedules

♦ Detect bufferlock and fail


Interface Automata

♦ A finite state machine that transitions on I/O actions• Not input-enabled (not every I/O on every cycle)

♦ G = (V, E, Ai, Ao, Ah, Vstart)• Ai = input actions x? (in CSP notation)• Ao = output actions y! ”• Ah = internal actions z; ”• E ⊂ V x (Ai ∪ Ao ∪ Ah) x V (transition on action)

♦ Execution trace = (v, a, v, a, …) (non-deterministic branching)

S s? S’

T

F

T’

F’

st;

sf;

t?

f?

o!

o!

s

t

f

o select

o

s t f

de Alfaro + Henzinger,Symp. Found. SW Eng.(FSE) 2001

12/2/05 65

AB A’B

AB’ A’B’

AutomataComposition

Automata Composition

♦ Composition ~ product FSM with synchronization(rendezvous) on common actions

Ax?

A’y!

B

z!

B’

y?

A Byx z

x?

y;z!

x?

z!

AB A’B

AB’ A’B’

x?

y!

x?

y!

z! y? z! y?

Direct Product

Composition edges:(I) step A on unshared action(ii) step B on unshared action(iii) step both on shared action

CompatibleComposition

→BoundedMemory


Stream Buffer Bounds Analysis

♦ Given a process network, find minimumbuffer sizes to avoid bufferlock

♦ Buffer (queue) is also automaton

♦ Symbolic Park’s algorithm• Compose network using

arbitrary buffer sizes• If deadlock, try larger sizes

♦ Practical considerations:avoiding state explosion• Multi-action automata• Know which streams to expand first• Compose pairwise in clever order

♦ Composition is associative

• Cull states reaching deadlock• Partition system

A Bxi o

y

Qx

A Bxi o

y

1 20

i?

o!i?o!

i?

o!Q


SFSM Decomposition (Partitioning)

♦ Why decompose• To improve locality

• To fit into custom page resources

♦ Decomposition by state clustering• 1 state (i.e. 1 cluster) active at a time

♦ Cluster states to contain transitions• Fast local transitions, slow external trans.

• Formulation: minimize cut of transition probability under area, I/O constraints

♦ Similar to:• VLIW trace scheduling [Fisher ‘81]

• FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]

• GarpCC HW/SW partitioning [Callahan ‘00]

• VM/cache code placement

State flow

Data flow


Early SFSM Decomposition Results

♦ Approach 1: Balanced, multi-way, min-cut• Modified Wong FBB [Yang+Wong, ACM ‘94]

• Edge weight is mix: c*(transition probability) + (1-c)*(wire bits)

• Poor at simultaneous I/O constraint + cut optimization

♦ Approach 2: Spectral order + Extent cover• Spectral ordering clusters connected components in 1D

♦ Minimize squared weighted distance, weight is mix (as above)

• Then choose area + I/O feasible extents [start, end] using dynamic programming

• Effective for partitioning to custom page resources

♦ Under 2% external transitions• Amdahl’s law: few slow transitions ⇒ small performance loss

• Achievable with either approach


Summary

♦ Streaming addresses large system design challenges• Growing interconnect delay • Design complexity• Flexibly timed module interfaces • Reuse

♦ Methodology to compile SCORE applications to Virtex-II Pro• Language + compiler support for streaming

♦ Characterized 7 applications on Virtex-II-Pro• Queue area ~38% • Flow control FSM area ~6%

♦ Improve by merging SFSMs, eliminating queues

♦ Stream pipelining• For logic • For interconnect

♦ Stream based optimizations• Pipelining • Queue sizing • Module Merging • Partitioning• Placement • Serialization


Supplemental Material


TDF ∈ Dataflow Process Networks

♦ Dataflow Process Networks[Lee+Parks, IEEE May ‘95]• Process enabled by set of firing rules: R = { r1, r2, … , RK }

• Firing rule = set of input patterns: ri = ( ri,1, ri,2 , … , ri,M )

♦ DF process for a TDF operator:• Feedback arc for state

• Firing rule(s) per state♦ Patterns match state + input presence♦ E.g. for state σ : rσ = ( [σ], rσ,1, rσ,2 , … )

♦ Patterns: rσ,j = [*] if input j is in input signature of state σ rσ,j = ⊥ if input j is not in input signature of state σ

• Single firing rule per state = DFPN sequential firing rules

• Multiple firing rules per state translate the same way, with restrictions to retain determinism

process stat

e


SFSM Partitioning Transform

♦ Only 1 partition active at a time• Transform to activate via streams

♦ New state in each partition: “wait”• Used when not active

• Waits for activation from other partition(s)

• Has one input signature (firing rule) per activator

♦ Firing rules are not sequential,but determinism guaranteed• Only 1 possible activator

♦ Activation streams fromgiven source to given dest.partitions can be merged +binary-encoded

A

B

C

D

A

B

WaitAB

C

D

WaitCD

{A,B}

{A,B}

{C,D}

{C,D}

73

Virtual Paged Hardware (SCORE)

♦ Compute model has unbounded resources• Programmer does not target a particular device size

♦ Paging• “Compute pages” swapped in/out (like virtual memory)• Page context = thread (FSM to block on stream access)

♦ Efficient virtualization• Amortize reconfiguration cost over an entire input buffer• Requires “working sets” of tightly-communicating pages to fit on device

buffers

Transform Quantize RLE Encode

computepages

Documents

Design Automation for Streaming Systemsbrass.cs.berkeley.edu/documents/eylon_thesis/thesis-talk.pdf · •FIFO buffered channel ... •Encode EOS E as extra D bit (out of band, easy