CprE / ComS 583Reconfigurable Computing
Prof. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University
Lecture #23 – Function Unit Architectures
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.2
Quick Points
• Next week Thursday, project status updates• 10 minute presentations per group + questions• Upload to WebCT by the previous evening• Expected that you’ve made some progress!
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.3
Allowable Schedules
Active LUTs (NA) = 3
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.4
Sequentialization
• Adding time slots • More sequential (more latency)• Adds slack
• Allows better balance
L=4 NA=2 (4 or 3 contexts)
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.5
Multicontext Scheduling
• “Retiming” for multicontext• goal: minimize peak resource requirements
• NP-complete• List schedule, anneal
• How do we accommodate intermediate data?• Effects?
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.6
Signal Retiming
• Non-pipelined • hold value on LUT Output (wire)
• from production through consumption
• Wastes wire and switches by occupying• For entire critical path delay L• Not just for 1/L’th of cycle takes to cross wire
segment
• How will it show up in multicontext?
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.7
Signal Retiming
• Multicontext equivalent• Need LUT to hold value for each intermediate
context
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.8
• Logically three levels of dependence
• Single Context: 21 LUTs @ 880K2=18.5M2
Full ASCII Hex Circuit
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.9
• Three contexts: 12 LUTs @ 1040K2=12.5M2
• Pipelining needed for dependent paths
Multicontext Version
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.10
ASCIIHex Example • All retiming on wires (active outputs)
• Saturation based on inputs to largest stage• With enough contexts only one LUT needed• Increased LUT area due to additional stored configuration
information• Eventually additional interconnect savings taken up by LUT
configuration overhead
IdealPerfect scheduling spread + no retime overhead
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.11
@ depth=4, c=6: 5.5M2 (compare 18.5M2 )
ASCIIHex Example (cont.)
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.12
General Throughput Mapping
• If only want to achieve limited throughput• Target produce new result every t cycles• Spatially pipeline every t stages
• cycle = t • Retime to minimize register requirements• Multicontext evaluation w/in a spatial stage
• Retime (list schedule) to minimize resource usage
• Map for depth (i) and contexts (c)
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.13
• 23 MCNC circuits• Area mapped with SIS and Chortle
Benchmark Set
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.14
Area v. Throughput
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.15
Area v. Throughput (cont.)
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.16
Reconfiguration for Fault Tolerance
• Embedded systems require high reliability in the presence of transient or permanent faults
• FPGAs contain substantial redundancy • Possible to dynamically “configure around”
problem areas• Numerous on-line and off-line solutions
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.17
• Huang and McCluskey• Assume that each FPGA column is
equivalent in terms of logic and routing• Preserve empty columns for future use• Somewhat wasteful
• Precompile and compress differences in bitstreams
Column Based Reconfiguration
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.18
• Create multiple copies of the same design with different unused columns
• Only requires different inter-block connections• Can lead to unreasonable configuration count
Column Based Reconfiguration
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.19
• Determining differences and compressing the results leads to “reasonable” overhead
• Scalability and fault diagnosis are issues
Column Based Reconfiguration
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.20
Summary
• In many cases cannot profitably reuse logic at device cycle rate• Cycles, no data parallelism• Low throughput, unstructured• Dissimilar data dependent computations
• These cases benefit from having more than one instructions/operations per active element
• Economical retiming becomes important here to achieve active LUT reduction
• For c=[4,8], I=[4,6] automatically mapped designs are 1/2 to 1/3 single context size
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.21
Outline
• Continuation• Function Unit Architectures
• Motivation• Various architectures• Device trends
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.22
Coarse-grained Architectures
• DP-FPGA • LUT-based • LUTs share configuration bits
• Rapid• Specialized ALUs, mutlipliers• 1D pipeline
• Matrix• 2-D array of ALUs
• Chess• Augmented, pipelined matrix
• Raw• Full RISC core as basic block• Static scheduling used for communication
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.23
DP-FPGA
• Break FPGA into datapath and control sections• Save storage for LUTs and connection transistors• Key issue is grain size• Cherepacha/Lewis – U. Toronto
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.24
MC = LUT SRAM bitsCE = connection block pass transistors
CEN
MCN
CE*NMCA(N)
Set MC = 2-3CE
0 1 1 1 0 0 1 0
Y0
Y1
A0
B0C0
A1
B1C1
Configuration Sharing
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.25
Two-dimensional Layout
• Control network supports distributed signals
• Data routed as four-bit values
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.26
• Ideal case would be if all datapath divisible by 4, no “irregularities”
• Area improvement includes logic values only• Shift logic included
DP-FPGA Technology Mapping
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.27
Cell Cell Cell
RaPiD
• Reconfigurable Pipeline Datapath• Ebeling –University of Washington• Uses hard-coded functional units (ALU, Memory,
multiply)• Good for signal processing• Linear array of processing elements
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.28
• Segmented linear architecture• All RAMs and ALUs are pipelined• Bus connectors also contain registers
RaPiD Datapath
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.29
RaPiD Control Path
• In addition to static control, control pipeline allows dynamic control
• LUTs provide simple programmability• Cells can be chained together to form continuous pipe
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.30
• Measure system response to input impulse• Coefficients used to scale input• Running sum determined total
FIR Filter Example
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.31
FIR Filter Example (cont.)
• Chain multiple taps together (one multiplier per tap)
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.32
MATRIX
• Dehon and Mirsky -> MIT• 2-dimensional array of ALUs• Each Basic Functional Unit contains
“processor” (ALU + SRAM)• Ideal for systolic and VLIW computation• 8-bit computation• Forerunner of SiliconSpice product
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.33
Basic Functional Unit• Two inputs from
adjacent blocks• Local memory for
instructions, data
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.34
MATRIX Interconnect
• Near-neighbor and quad connectivity• Pipelined interconnect at ALU inputs• Data transferred in 8-bit groups• Interconnect not pipelined
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.35
Functional Unit Inputs
• Each ALU inputs come from several sources
• Note that source is locally configurable based on data values
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.36
K=8
FIR Filter Example
• For k-weight filter 4K cells needed • One result every 2 cycles
• K/2 8x8 multiplies per cycle
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.37
Chess
• HP Labs – Bristol, England• 2-D array – similar to Matrix• Contains more “FPGA-like” routing
resources• No reported software or application
results• Doesn’t support incremental compilation
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.38
Chess Interconnect
• More like an FPGA• Takes advantage
of near-neighbor connectivity
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.39
• Switchbox memory can be used as storage
• ALU core for computation
Chess Basic Block
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.40
Reconfigurable Architecture Workstation
• MIT Computer Architecture Group• Full RISC processor located as
processing element• Routing decoupled into switch mode• Parallelizing compiler used to distribute
work load• Large amount of memory per tile
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.41
RAW Tile
• Full functionality in each tile• Static router located for near-neighbor
communication
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.42
RAW Datapath
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.43
Raw Compiler
• Parallelizes compilation across multiple tiles• Orchestrates communication between tiles• Some dynamic (data dependent) routing
possible
CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.44
Summary
• Architectures moving in the direction of coarse-grained blocks
• Latest trend is functional pipeline• Communication determined at compile
time• Software support still a major issue