Upload
vanhanh
View
219
Download
2
Embed Size (px)
Citation preview
Design Automationfor Streaming Systems
Eylon CaspiUniversity of California, Berkeley
12/2/05
IA IB
OA OB
12/2/05 Eylon Caspi 2
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05 Eylon Caspi 3
Large System Design Challenges
♦ Devices growing with Moore’s Law
• AMD Opteron dual core CPU: ~230M transistors
• Xilinx Virtex 4 / Altera Stratix-II FPGAs: ~200K LUTs
♦ Problems of DSM, large systems
• Growing interconnect delay, timing closure
♦ “Routing delays typically account for 45% to 65% of the total path delays” (Xilinx Constraints Guide)
• Slow place-and-route
• Design complexity
• Designs do not scale well on next gen. device; must redesign
♦ Same problems in FPGAs
12/2/05 Eylon Caspi 4
Limitations of RTL
♦ RTL = Register Transfer Level
♦ Fully exposed timing behavior• always @(posedge clk) ...
♦ Laborious, error prone
♦ Unpredictable interconnect delay• How deep to pipeline?
• Redesign on next-gen device
♦ Undermines reuse
♦ Existing solutions• Modular design • Floorplanning
• Physical synthesis • Hierarchical CAD
• Latency insensitive communication
12/2/05 Eylon Caspi 5
Streams
♦ A better communication abstraction
♦ Streams connect modules• FIFO buffered channel (queue)• Blocking read• Timing independent (deterministic)
♦ Robust to communication delay• Pipeline across long distances• Robust to unknown delay
♦ Post-placement pipelining♦ Alternate transport (packet switched NOC)
♦ Flexibly timed module interfaces• Robust to module optimization (pipeline, reschedule, etc.)
♦ Enhances modular design + reuse
Memory Module(compute)
Stream
12/2/05 Eylon Caspi 6
Streaming Applications
♦ Persistent compute structure (infrequent changes)
♦ Large data sets, mostly sequential access
♦ Limited feedback
♦ Implement with deep,system level pipelining
♦ E.g. DSP, multimedia
♦ JPEG Encoder:
12/2/05 6
12/2/05 Eylon Caspi 7
Ad Hoc Streaming
♦ Every module needs streaming flow control• Block if inputs not available, output not ready to receive
♦ Every stream needs queueing• Pipeline to match interconnect delay
• Queue to absorb delay mismatch, dynamic rates
♦ Manual implementation, in HDL• Laborious (flow control, queues)
• Error prone (deadlock if violate protocol, queue too small)
• No automation (pipeline depth, queue choice / width / depth)
♦ Interconnect / queue IP (e.g. OCP / Sonics Bus)• Still no automation
12/2/05 Eylon Caspi 8
Systematic Streaming
♦ Strong stream semantics: Process Networks• Stream = FIFO channel with (flavor of) blocking read
• E.g. Kahn Process Networks,E.g. Dataflow Process Networks (E.A.Lee)
♦ Streams as programming primitive• Language support hides flow control
♦ Compiler support• Compiler generated flow control
• Compiler controlled pipelining, queue depth, queue impl.
• Compiler optimizations (e.g. module merging, partitioning)
♦ Benefits
• Easy, correct, high performance • Portable
• Paging / Virtualization is a logical extension (Automatic page partitioning)
12/2/05 Eylon Caspi 9
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05 Eylon Caspi 10
SCORE Model
♦ Application = Graph of stream-connected operators
♦ Operator = Process with local state
♦ Stream = FIFO channel, unbounded capacity, blocking read
♦ Segment = Memory, accessed via streams
♦ Dynamics:• Dynamic I/O rates
• Dynamic graph construction(omitted in this work)
Segment
Operator(SFSM)
Stream
Stream ComputationsOrganized forReconfigurable Execution
12/2/05 Eylon Caspi 11
SCORE Programming Model: TDF
♦ TDF = behavioral language for• SFSM Operators (Streaming Extended Finite State Machine)
• Static operator graphs
♦ State machine for
• Firing control • Sequencing, branching
♦ Firing semantics• In state X, wait for X’s inputs, then evaluate X’s action
state foo (i, j): o = i + j; goto bar;}
i j
o
12/2/05 Eylon Caspi 12
SCORE / TDF Process Networks
♦ A process from M inputs to N outputs,unified stream type S (i.e. SM→SN)
♦ SFSM = (Σ, σ0, σ, R, fNS, fO)• Σ = Set of states• σ0 ∈ Σ = Initial state• σ ∈ Σ = Present state• R ⊆ (Σ × SM) = Set of firing rules• fNS : R→Σ = Next state function• fO : R→SN = Output function
♦ Similar to dataflow process networks[Lee+Parks, IEEE May ‘95],but with stateful actors
12/2/05 Eylon Caspi 13
Related Streaming Models
♦ Streaming Models• Kahn PN, DFPN, BDF, SDF, CSDF, HDF,
StreamsC, YAPI, Catapult C, SHIM
♦ Streaming Platforms• Pleiades, Philips VSP, Imagine, TRIPS
♦ How do we differ?• Stateful processes
• Deterministic
• Dynamic dataflow rates (FSM nodes)
• Direct synthesis to hardware
• Bounded Buffers
12/2/05 Eylon Caspi 14
Streaming Platforms
♦ FPGA (this work)
♦ Paged FPGA• Page = fixed size partition,
connected by streams• Stylized page-to-page interconnect• Hierarchical PAR
♦ Paged, Virtual FPGA (SCORE)• Time shared pages• Area abstraction (virtually large)
♦ Multiprocessor on Chip
♦ Heterogeneous
12/2/05 Eylon Caspi 15
The Compilation Problem
Programming Model: TDF Execution Model: FPGA
• Communicating SFSMs • Single circuit / configuration
- unrestricted size, # IOs, timing - one or more clocks
• Unbounded stream buffering • Fixed size queues
Compile
memorysegment
TDFoperator
stream Big semantic gap
FPGA
12/2/05 Eylon Caspi 16
The Semantic Gap
♦ Semantic gap between TDF, HW
♦ Need to bind:• Stream protocol
• Stream pipelining
• Queue implementation
• Queue depths
• SFSM synthesis style (behavioral synthesis)
• Memory allocation
• Primary I/O
♦ SCORE device binds some implementation decisions(custom hardware), raw FPGA does not
♦ Want to characterize cost of implementation decisions
Compile FPGA
12/2/05 Eylon Caspi 17
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05 Eylon Caspi 18
Compilation Tool Flow
tdfc
Synplify
Xilinx ISE
Application
Bits
Verilog
EDIF(Unplaced LUTs, etc.)
Device Configuration
• Local optimization• System optimization
• Queue sizing• Pipeline extraction• SFSM partitioning / merging• Pipelining
• Generate flow ctl, streams, queues
• Behavioral Synthesis• Retiming
• Slice packing• Place and route
TDF
12/2/05 Eylon Caspi 19
Wire Protocol for Streams
♦ D = Data, V = Valid, B = Backpressure
♦ Synchronous transaction protocol• Producer asserts V when D ready, Consumer deasserts B when ready• Transaction commits if (¬B ∧ V) at clock edge
• Encode EOS E as extra D bit (out of band, easy to enqueue)
Producer Consumer
D (Data), E (EOS)
V (Valid)
B (Backpressure)
D
V
B
Clk
12/2/05 Eylon Caspi 20
Operator Firing
♦ In state X, fire if• Inputs desired by X are ready (Valid, EOS)
• Outputs emitted by X are ready (Backpressure)
♦ Firing guard / control flow
• if (iv && !ie && !ob) begin ib=0; ov=1; ...end
♦ Subtlety: master, slave• Operator is slave
♦ To synchronize streams, (1) wait for flow control in, (2) fire / emit out
• Connecting two slaves would deadlock
• Need master (queue) between every pair of operators
Opid,eivib
od,eovob
12/2/05 21
SFSMSynthesis
Control
Data registers
Stream I/O
DatapathFor State 1
DatapathFor State 2
FSM
V E DB
V E DB
Datapath
♦ Implemented asBehavioralVerilog, usingstate ‘case’ inFSM and DP
♦ FSM handlesfiring control,branching
♦ FSM sends stateto DP
♦ DP sends bool.flags to FSM
12/2/05 Eylon Caspi 22
FSM Module, Firing Control
foo (input unsigned[16] x, input unsigned[16] y, output unsigned[16] o){ state one (x, eos(y)) : o=x+1; ...}
module foo_fsm (clock, reset, x_e, x_v, x_b, y_e, y_v, y_b, o_e, o_v, o_b, state, statecase); ... always @* begin x_b_=1; y_b_=1; o_e_=0; o_v_=0; state_reg_ = state_reg; statecase_ = statecase_stall; did_goto_ = 0;
case (state_reg) state_one: begin if (x_v && !x_e && y_v && y_e && !o_b) begin statecase_ = statecase_1; x_b_=0; y_b_=0; o_v_=1; o_e_=0; end ... end // always @*endmodule // foo_fsm
TDF:
Verilog
FSM
Module:Default is stall
Firing condition(s)for state one
Stream flow ctlfor state one
12/2/05 Eylon Caspi 23
Data-Path Module
foo (input unsigned[16] x, input unsigned[16] y, output unsigned[16] o){ state one (x, eos(y)) : o=x+1; ...}
module foo_dp (clock, reset, x_d, y_d, o_d, state, statecase); ... always @* begin o_d_=16’bx; did_goto_ = 0;
case (state) state_one: begin if (statecase_ == statecase_1) begin o_d_ = (x_d + 1’d1); end ... end // always @*endmodule // foo_dp
TDF:
Verilog
Data-path
Module:
Default is stall
Firing condition(s)for state one
Data-pathfor state one
12/2/05 Eylon Caspi 24
Stream Buffers (Queues)
♦ Systolic• Cascade of depth-1
stages (or depth-N)
♦ Shift register• Put: shift all entries• Get: tail pointer
♦ Circular buffer• Memory with
head / tail pointers
12/2/05 Eylon Caspi 25
Enabled Register Queue
iDiViB
oDoVoB
en
♦ Systolic, depth-1 stage
♦ 1 state bit (empty/full) = V
♦ Shift in data unless:• Full and downstream not ready
to consume queued element
♦ Area 1 FF per data bit• On FPGA 1 LUT cell per data bit
• Depth-1 (single stage) nearly free,since FFs pack with logic
♦ Speed: as fast as FF• But combinationally connects producer + consumer via B
12/2/05 Eylon Caspi 26
Xilinx SRL16
♦ SRL16 = Shift register of depth 16in one 4-LUT cell
• Shift register of arbitrary width: parallel SRL16,arbitrary depth: cascade SRL16
♦ Improve queue density by 16x
SRL16 Mode4-LUT Mode
12/2/05 Eylon Caspi 27
Shift Register Queue
♦ State: empty bit +capacity counter
♦ Data stored in shift register• In at position 0• Out at position Address
♦ Address = number of stored elements minus 1
♦ Synplify infers SRL16Efrom Verilog array• Parameterized depth, width
♦ Flow control• ov = (State==Non-Empty)• ib = !(Address==Depth-1)
♦ Performance improvements• Registered data out• Registered flow control• Specialized, pre-computed
fullness
+1 -10
Shift Regen
=Depth-1 =0
FSM
Empty
Non-Empty
Address
iB iV iD,E
oB oV oD,E
full
zero
12/2/05 Eylon Caspi 28
SRL Queue with Registered Data Out
♦Registered data out• od (clock-to-Q delay)• Non-retimable
♦Data output registerextends shift register
♦Bypass shift registerwhen queue empty
♦3 States
♦Address = numberof stored elementsminus 2
♦Flow control• ov = !(State==Empty)• ib = (Address
==Depth-2)
+1 -10
Shift Regen
=Depth-2 =0
Address
Data Out
One
More
Empty
iB iV iD,E
oB oV oD,E
fullzero
12/2/05 Eylon Caspi 29
SRL Queue with Registered Flow Ctl.
♦Registered flow ctl.• ov (clock-to-Q delay)• ib (clock-to-Q delay)• Non-retimable
♦Flow control• ov_next = !(State_next
==Empty)• ib_next =
(Address_next ==Depth-2)
♦Based on pre-computed fullness• full_next =
(Address_next ==Depth-2)
+1 -10
Shift Regen
=Depth-2 =0
Address
Data Out
One
More
Empty
iB iV iD,E
oB oV oD,E
fullzero
full_next
12/2/05 Eylon Caspi 30
SRL Queue with Specialized,Pre-Computed Fullness
♦ Speed up critical fullpre-computation byspecial-casing full_nextfor each state
♦ Flow control• ov_next = !(State_next
==Empty)• ib_next = full_next
♦ zero pre-computation isless critical
♦ Result• >200MHz unless very
large (e.g. 128 x 128)
• All output delays areclock-to-Q
• Area ≈ 3 x (SRL16E area)
+1 -10
Shift Regen
=Depth-3 =0
Data Out
One
More
Empty
iB iV iD,E
oB oV oD,E
full
zero
Address
12/2/05 Eylon Caspi 31
SRL Queue Speed
12/2/05 Eylon Caspi 32
SRL Queue Area
12/2/05 Eylon Caspi 33
Page Synthesis
♦ Page = Cluster of Operator(s) + Queues
♦ SFSMs• One or more per page
• Further decomposed intoFSM, data-path
♦ Page Input Queues• Deep
• Drain pipelined page-to-page streams beforereconfiguration
♦ In-page Queues• Shallow
♦ Separately SynthesizableModules• Separate characterization
• Consider custom resources
PageInput
Queue(s)Queue
Op 2Datapath
Op 1Datapath
Op 1FSM
Op 2FSM
12/2/05 Eylon Caspi 34
Page Synthesis
PageInput
Queue(s)Queue
Op 2Datapath
Op 1Datapath
Op 1FSM
Op 2FSM
♦ Module Hierarchy
• Local / output queues
• Individual SFSMs(combinational cores)
• Operators andlocal / output queues
• Input queues
• Page
12/2/05 Eylon Caspi 35
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05 Eylon Caspi 36
Tool Flow, Revisited
tdfc
Synplify
Xilinx ISE
Application
Bits
Verilog
EDIF(Unplaced LUTs, etc.)
Device Configuration
♦ Separate compilationfor application, SFSMs
• Page • SFSM
• Datapath • FSM• Identical queuing for every stream (SRL16 based, depth 16)
• I/O boundary regs (for Xilinx static timing analysis)
• Synplify 8.0
• Target 200MHz
• Optimize: FSM, retiming, pipelining
• Retain monolithic FSM encodings
• ISE 6.3i
• Constrain to minimum square area, at least max slice packing + 20%, expand if fail PAR
Tool Options
• Device: XC2VP70 -7
37
PAR Flow for Minimum Area
♦ EDIF netlist from Synplify
♦ Constraints file• Page area• Target Period
♦ ngdbuild:• Convert netlist EDIF → NGD
♦ map:• Pack LUTs, MUXes, etc. into slices
♦ trce: (pre-PAR)• Static timing analysis, logic only
♦ par:• Place and route
♦ trce: (post-PAR)• Static timing analysis
EDIF
ngdbuild
map
par
trce
Constraints
Ok?
Ok?
no
no
yes
yes
trce
Targetpackedslices
Target1 extrarow/col
Targetpackedtiming
12/2/05 Eylon Caspi 38
SCORE Applications
♦ 7 Multimedia Applications / 279 Operators
• MPEG, JPEG, Wavelet, IIR • Written by Joe Yeh
• Mostly feed-forward streaming
• Constant consumption / production ratios,except compressors (ZLE, Huffman)
Speed Area %Area %Area
Application SFSMs Segments In Local Out (MHz) (4-LUT cells) FSMs Queues
IIR 8 0 1 7 1 166 1,922 3.4% 27.7%
JPEG Decode 9 1 1 41 8 47 7,442 7.0% 28.7%
JPEG Encode 11 4 8 42 1 57 6,728 7.5% 36.9%
MPEG Encode IP 80 16 6 231 1 47 41,472 5.5% 39.7%
MPEG Encode IPB 114 17 3 313 1 50 65,772 5.2% 40.5%
Wavelet Encode 30 6 1 50 7 106 8,320 10.1% 32.0%
Wavelet Decode 27 6 7 49 1 109 8,712 8.5% 29.6%
Total 279 50 27 733 20 140,368 5.9% 38.3%
Streams
12/2/05 Eylon Caspi 39
Page Area
♦ 87% of SFSMs are smaller than 512 LUTs — by design
♦ FSMs small ♦ Datapaths dominate in most large pages
DCT, IDCT
12/2/05 Eylon Caspi 40
Page Speed
♦ FSMs (flow control) are fast, never critical
♦ Queues are critical for 1/3 fastest pages ♦ Datapaths dominate
43% 47% 10%
12/2/05 Eylon Caspi 41
Outline
♦ Streaming for Hardware
♦ From Programming Model to Hardware Model
♦ Synthesis Methodology for FPGA• Streams, Queues, SFSM Logic
♦ Characterization of 7 Multimedia Apps
♦ Optimizations• Pipelining, Placement, Queue Sizing, Decomposition
12/2/05 Eylon Caspi 42
Improving Performance, Area
♦ Local (module) Optimization• Traditional compiler optimization
♦ (const folding, CSE, etc.)
• Datapath pipelining / loop scheduling• Granularity transforms
♦ (composition / decomposition)
♦ System Level Optimization• Interconnect pipelining• Shrink / remove queues• Area-time transformations
♦ (rate matching, serialization, parallelization)
12/2/05 Eylon Caspi 43
Pipelining With Streams
♦ Datapath pipelining• Add registers at output (or input)
• Retime into place
♦ Harder in practice (FSM, cycles)• Add registers at strategic locations
• Rewrite control
• Avoid violating communication protocol
♦ Stream pipelining• Add registers on streams
• Retime into datapath
• Modify queues, not processes
FSM
DP
DP
FSM
DP
12/2/05 Eylon Caspi 44
Logic Pipelining
ProducerQueue
withL Reserve
D (Data)
V (Valid)
B (Backpressure)
Retime
Consumer
D
V
B
♦ Add L pipeline registers to D, V
♦ Retime backwards• This pipelines feed-forward parts of producer’s data-path
♦ Stale flow control may overflow queue (by L)
♦ Modify queue to emit back-pressure when empty slots ≤ L
♦ No manual modification of processes
12/2/05 Eylon Caspi 45
Logic Relaying + Retiming
♦ Break-up deep logic in a process
♦ Relay through enabled register queue(s)
♦ Retime registers into adjacent process• This pipelines feed-forward parts of process’s datapath
• Can retime into producer or consumer
♦ No manual modification of processes
ProducerOriginalQueue
D
V
B
Consumer
D
V
B
D
V
B
en
Retime
12/2/05 Eylon Caspi 46
Benefits, Limitations
♦ Benefits• Simple to implement, relies only on retiming
• Sufficient for many cases, e.g. DCT, IDCT
♦ Limitations• Feed-forward only (weaker than loop sched.)
• Resource sharing obfuscates retiming opportunities
♦ Extends to interconnect pipelining• Do not retime into logic — register placement only
• Also pipeline B, modify queue
12/2/05 Eylon Caspi 47
Pipelining Configuration
♦ Pipeline depth parameters: Li+Lp+Lr
♦ Uniform pipelining: same depths for every stream
Queuewith Reserve
Lpen
SFSMen
D
V
B
D
V
B
D
V
B
D
V
B
D
V
B
D
V
B
Input sideLogic Relaying
Li
Output sideLogic Relaying
Lr
LogicPipelining
Lp
Retime Retime
Retime
12/2/05 Eylon Caspi 48
Speedup from Logic Pipelining
Enabled Regs (Lr) D FFs (Lp)
12/2/05 Eylon Caspi 49
Expansion from Logic Pipelining
Enabled Regs (Lr) D FFs (Lp)
12/2/05 Eylon Caspi 50
Some Things Are Better Left Unpipelined
♦ Pagespeedup:
♦ Pageexpansion:
♦ Initially fastpages shouldnot be pipelined
12/2/05 Eylon Caspi 51
Page Specific Logic Pipelining
♦ Separate pipelining of each SFSM
♦ Assumption:application speed = slowest page speed• Critical Page
♦ Repeatedly improve slowest pageuntil no further improvement is possible
♦ Page improvement heuristics• Greedy Lr : Add one level of pipelining in 0+0+Lr
• Greedy Lp : Add one level of pipelining in 1+Lp+0
• Max : Pipeline to best page speed (brute force)
♦ Greedy heuristics may end early• Non-monotonicity: adding a level of pipelining may slow page
12/2/05 Eylon Caspi 52
Speedup from Page Specific
Enabled Regs (Lr) D FFs (Lp)
12/2/05 Eylon Caspi 53
Expansion from Page Specific
Enabled Regs (Lr) D FFs (Lp)
12/2/05 Eylon Caspi 54
Interconnect Delay
♦ Critical routing delay grows with circuit size• Routing delay for an application: avg. 45% - 56%• Routing delay for its slowest page: avg. 40% - 50%• Ratio (appl. to slowest page): avg. 0.99x - 1.34x
♦ Averaged over 7 apps / varies with logic pipelining
♦ Modular design helps• Retain critical routing delay of page, not application• Page-to-page delays (streams) can be pipelined
12/2/05 Eylon Caspi 55
Interconnect Pipelining
ProducerQueue
with2W Reserve
D (Data)
V (Valid)
B(Backpressure)
Consumer
D
V
B
♦ Add W pipeline registers to D, V, B• Mobile registers for placer • Not retimable
♦ Stale flow control may overflow queue (by 2W)• Staleness = total delay on B-V feedback loop = 2W
♦ Modify downstream queue to emit back-pressure whenempty slots ≤ 2W
Long distance
12/2/05 Eylon Caspi 56
Speedup from Interconnect Pipelining
12/2/05 Eylon Caspi 57
Speedup from Interconnect Pipelining,No Area Constraint
12/2/05 Eylon Caspi 58
Expansion from InterconnectPipelining, No Area Constraint
12/2/05 Eylon Caspi 59
Interconnect Register Allocation
♦ Commercial FPGAs / tool flows• No dedicated interconnect registers
• Allocation: add to netlist, slice pack, place-and-route
• If pack registers with logic limited register mobility
• If pack registers alone area overhead
♦ Better: Post-placement register allocation• Weaver et al., “Post-Placement C-Slow Retiming for the Xilinx
Virtex FPGA,” FPGA 2003
• Allocation: PAR, c-slow, retime, scavenge registers, reroute
• No area overhead (scavenge registers from existing placement)
• Better performance, since know routing delay
• Modification for streaming:♦ PAR, pipeline, retime, scavenge registers, reroute,
modify queue depths (configuration specialization)
12/2/05 Eylon Caspi 60
Throughput Modeling
♦ Pipelining feedback loops may reducethroughput (tokens per clock period)• Which loops / streams are critical?
♦ Throughput model for PN• Feedback cycle C with
M tokens, N pipe delays,has token period: TC = M/N
• Overall token period: T = maxC {TC }
• Available slack: CycleSlackC = (T - TC)
• Generalize to multi-rate, dynamic rate by unfoldingequivalent single-rate PN
TC1 = 3
TC2 = 2
12/2/05 Eylon Caspi 61
Throughput Aware Optimizations
♦ Throughput aware placement• Adapt [Singh+Brown, FPGA 2002]
• Stream slack: Te = maxC s.t. e∈C {TC }
• Stream net criticality: crit = 1 - ((T - Te) / T)
♦ Throughput aware pipelining• Pipeline stream w/o exceeding slack
• Pipeline module s.t. depth does not exceed any outputstream slack
♦ Pipeline balancing (by retiming)
♦ Process Serialization• Serial arithmetic for process with low throughput, high slack
12/2/05 Eylon Caspi 62
Stream Buffer Sizing
♦ Fixed size buffers in hardware• For minimum area (want smallest feasible queue)
• For performance (want deep enough to avoid stalls from producer-consumer timing mismatch)
♦ Semantic gap• Buffers are unbounded in TDF,
bounded in HW
• Small buffer may create artificial deadlock (bufferlock)
• Theorem: memory boundis undecidablefor a Turing completeprocess network
• In practice, our buffering requirements are small
=x
=y
x=y=
x
y
=x
=y
x=y=
x
y
Bounded
Unbounded
12/2/05 Eylon Caspi 63
Dealing with Undecidability
♦ Handle unbounded streams
• Buffer expansion [Parks ‘95]
♦ Detect bufferlock, expand buffers
• Hardware implementation
♦ Buffer expansion = rewire to another queue
♦ Storage in off-chip memory or queue bank
♦ Guarantee depth bound for some cases
• User depth annotation
• Analysis
♦ Identify compatible SFSMs with balanced schedules
♦ Detect bufferlock and fail
12/2/05 Eylon Caspi 64
Interface Automata
♦ A finite state machine that transitions on I/O actions• Not input-enabled (not every I/O on every cycle)
♦ G = (V, E, Ai, Ao, Ah, Vstart)• Ai = input actions x? (in CSP notation)• Ao = output actions y! ”• Ah = internal actions z; ”• E ⊂ V x (Ai ∪ Ao ∪ Ah) x V (transition on action)
♦ Execution trace = (v, a, v, a, …) (non-deterministic branching)
S s? S’
T
F
T’
F’
st;
sf;
t?
f?
o!
o!
s
t
f
o select
o
s t f
de Alfaro + Henzinger,Symp. Found. SW Eng.(FSE) 2001
12/2/05 65
AB A’B
AB’ A’B’
AutomataComposition
Automata Composition
♦ Composition ~ product FSM with synchronization(rendezvous) on common actions
Ax?
A’y!
B
z!
B’
y?
A Byx z
x?
y;z!
x?
z!
AB A’B
AB’ A’B’
x?
y!
x?
y!
z! y? z! y?
Direct Product
Composition edges:(I) step A on unshared action(ii) step B on unshared action(iii) step both on shared action
CompatibleComposition
→BoundedMemory
12/2/05 Eylon Caspi 66
Stream Buffer Bounds Analysis
♦ Given a process network, find minimumbuffer sizes to avoid bufferlock
♦ Buffer (queue) is also automaton
♦ Symbolic Park’s algorithm• Compose network using
arbitrary buffer sizes• If deadlock, try larger sizes
♦ Practical considerations:avoiding state explosion• Multi-action automata• Know which streams to expand first• Compose pairwise in clever order
♦ Composition is associative
• Cull states reaching deadlock• Partition system
A Bxi o
y
Qx
A Bxi o
y
1 20
i?
o!i?o!
i?
o!Q
12/2/05 Eylon Caspi 67
SFSM Decomposition (Partitioning)
♦ Why decompose• To improve locality
• To fit into custom page resources
♦ Decomposition by state clustering• 1 state (i.e. 1 cluster) active at a time
♦ Cluster states to contain transitions• Fast local transitions, slow external trans.
• Formulation: minimize cut of transition probability under area, I/O constraints
♦ Similar to:• VLIW trace scheduling [Fisher ‘81]
• FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]
• GarpCC HW/SW partitioning [Callahan ‘00]
• VM/cache code placement
State flow
Data flow
12/2/05 Eylon Caspi 68
Early SFSM Decomposition Results
♦ Approach 1: Balanced, multi-way, min-cut• Modified Wong FBB [Yang+Wong, ACM ‘94]
• Edge weight is mix: c*(transition probability) + (1-c)*(wire bits)
• Poor at simultaneous I/O constraint + cut optimization
♦ Approach 2: Spectral order + Extent cover• Spectral ordering clusters connected components in 1D
♦ Minimize squared weighted distance, weight is mix (as above)
• Then choose area + I/O feasible extents [start, end] using dynamic programming
• Effective for partitioning to custom page resources
♦ Under 2% external transitions• Amdahl’s law: few slow transitions ⇒ small performance loss
• Achievable with either approach
12/2/05 Eylon Caspi 69
Summary
♦ Streaming addresses large system design challenges• Growing interconnect delay • Design complexity• Flexibly timed module interfaces • Reuse
♦ Methodology to compile SCORE applications to Virtex-II Pro• Language + compiler support for streaming
♦ Characterized 7 applications on Virtex-II-Pro• Queue area ~38% • Flow control FSM area ~6%
♦ Improve by merging SFSMs, eliminating queues
♦ Stream pipelining• For logic • For interconnect
♦ Stream based optimizations• Pipelining • Queue sizing • Module Merging • Partitioning• Placement • Serialization
12/2/05 Eylon Caspi 70
Supplemental Material
12/2/05 Eylon Caspi 71
TDF ∈ Dataflow Process Networks
♦ Dataflow Process Networks[Lee+Parks, IEEE May ‘95]• Process enabled by set of firing rules: R = { r1, r2, … , RK }
• Firing rule = set of input patterns: ri = ( ri,1, ri,2 , … , ri,M )
♦ DF process for a TDF operator:• Feedback arc for state
• Firing rule(s) per state♦ Patterns match state + input presence♦ E.g. for state σ : rσ = ( [σ], rσ,1, rσ,2 , … )
♦ Patterns: rσ,j = [*] if input j is in input signature of state σ rσ,j = ⊥ if input j is not in input signature of state σ
• Single firing rule per state = DFPN sequential firing rules
• Multiple firing rules per state translate the same way, with restrictions to retain determinism
process stat
e
12/2/05 Eylon Caspi 72
SFSM Partitioning Transform
♦ Only 1 partition active at a time• Transform to activate via streams
♦ New state in each partition: “wait”• Used when not active
• Waits for activation from other partition(s)
• Has one input signature (firing rule) per activator
♦ Firing rules are not sequential,but determinism guaranteed• Only 1 possible activator
♦ Activation streams fromgiven source to given dest.partitions can be merged +binary-encoded
A
B
C
D
A
B
WaitAB
C
D
WaitCD
{A,B}
{A,B}
{C,D}
{C,D}
73
Virtual Paged Hardware (SCORE)
♦ Compute model has unbounded resources• Programmer does not target a particular device size
♦ Paging• “Compute pages” swapped in/out (like virtual memory)• Page context = thread (FSM to block on stream access)
♦ Efficient virtualization• Amortize reconfiguration cost over an entire input buffer• Requires “working sets” of tightly-communicating pages to fit on device
buffers
Transform Quantize RLE Encode
computepages