Download ppt - CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #23 Function

CprE / ComS 583Reconfigurable Computing

Prof. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University

Lecture #23 – Function Unit Architectures

CprE 583 – Reconfigurable ComputingNovember 8, 2007 Lect-23.2

Quick Points

• Next week Thursday, project status updates• 10 minute presentations per group + questions• Upload to WebCT by the previous evening• Expected that you’ve made some progress!


Allowable Schedules

Active LUTs (NA) = 3


Sequentialization

• Adding time slots • More sequential (more latency)• Adds slack

• Allows better balance

L=4 NA=2 (4 or 3 contexts)


Multicontext Scheduling

• “Retiming” for multicontext• goal: minimize peak resource requirements

• NP-complete• List schedule, anneal

• How do we accommodate intermediate data?• Effects?


Signal Retiming

• Non-pipelined • hold value on LUT Output (wire)

• from production through consumption

• Wastes wire and switches by occupying• For entire critical path delay L• Not just for 1/L’th of cycle takes to cross wire

segment

• How will it show up in multicontext?


Signal Retiming

• Multicontext equivalent• Need LUT to hold value for each intermediate

context


• Logically three levels of dependence

• Single Context: 21 LUTs @ 880K2=18.5M2

Full ASCII Hex Circuit


• Three contexts: 12 LUTs @ 1040K2=12.5M2

• Pipelining needed for dependent paths

Multicontext Version


ASCIIHex Example • All retiming on wires (active outputs)

• Saturation based on inputs to largest stage• With enough contexts only one LUT needed• Increased LUT area due to additional stored configuration

information• Eventually additional interconnect savings taken up by LUT

configuration overhead

IdealPerfect scheduling spread + no retime overhead


@ depth=4, c=6: 5.5M2 (compare 18.5M2 )

ASCIIHex Example (cont.)


General Throughput Mapping

• If only want to achieve limited throughput• Target produce new result every t cycles• Spatially pipeline every t stages

• cycle = t • Retime to minimize register requirements• Multicontext evaluation w/in a spatial stage

• Retime (list schedule) to minimize resource usage

• Map for depth (i) and contexts (c)


• 23 MCNC circuits• Area mapped with SIS and Chortle

Benchmark Set


Area v. Throughput


Area v. Throughput (cont.)


Reconfiguration for Fault Tolerance

• Embedded systems require high reliability in the presence of transient or permanent faults

• FPGAs contain substantial redundancy • Possible to dynamically “configure around”

problem areas• Numerous on-line and off-line solutions


• Huang and McCluskey• Assume that each FPGA column is

equivalent in terms of logic and routing• Preserve empty columns for future use• Somewhat wasteful

• Precompile and compress differences in bitstreams

Column Based Reconfiguration


• Create multiple copies of the same design with different unused columns

• Only requires different inter-block connections• Can lead to unreasonable configuration count



• Determining differences and compressing the results leads to “reasonable” overhead

• Scalability and fault diagnosis are issues



Summary

• In many cases cannot profitably reuse logic at device cycle rate• Cycles, no data parallelism• Low throughput, unstructured• Dissimilar data dependent computations

• These cases benefit from having more than one instructions/operations per active element

• Economical retiming becomes important here to achieve active LUT reduction

• For c=[4,8], I=[4,6] automatically mapped designs are 1/2 to 1/3 single context size


Outline

• Continuation• Function Unit Architectures

• Motivation• Various architectures• Device trends


Coarse-grained Architectures

• DP-FPGA • LUT-based • LUTs share configuration bits

• Rapid• Specialized ALUs, mutlipliers• 1D pipeline

• Matrix• 2-D array of ALUs

• Chess• Augmented, pipelined matrix

• Raw• Full RISC core as basic block• Static scheduling used for communication


DP-FPGA

• Break FPGA into datapath and control sections• Save storage for LUTs and connection transistors• Key issue is grain size• Cherepacha/Lewis – U. Toronto


MC = LUT SRAM bitsCE = connection block pass transistors

CEN

MCN

CE*NMCA(N)

Set MC = 2-3CE

0 1 1 1 0 0 1 0

Y0

Y1

A0

B0C0

A1

B1C1

Configuration Sharing


Two-dimensional Layout

• Control network supports distributed signals

• Data routed as four-bit values


• Ideal case would be if all datapath divisible by 4, no “irregularities”

• Area improvement includes logic values only• Shift logic included

DP-FPGA Technology Mapping


Cell Cell Cell

RaPiD

• Reconfigurable Pipeline Datapath• Ebeling –University of Washington• Uses hard-coded functional units (ALU, Memory,

multiply)• Good for signal processing• Linear array of processing elements


• Segmented linear architecture• All RAMs and ALUs are pipelined• Bus connectors also contain registers

RaPiD Datapath


RaPiD Control Path

• In addition to static control, control pipeline allows dynamic control

• LUTs provide simple programmability• Cells can be chained together to form continuous pipe


• Measure system response to input impulse• Coefficients used to scale input• Running sum determined total

FIR Filter Example


FIR Filter Example (cont.)

• Chain multiple taps together (one multiplier per tap)


MATRIX

• Dehon and Mirsky -> MIT• 2-dimensional array of ALUs• Each Basic Functional Unit contains

“processor” (ALU + SRAM)• Ideal for systolic and VLIW computation• 8-bit computation• Forerunner of SiliconSpice product


Basic Functional Unit• Two inputs from

adjacent blocks• Local memory for

instructions, data


MATRIX Interconnect

• Near-neighbor and quad connectivity• Pipelined interconnect at ALU inputs• Data transferred in 8-bit groups• Interconnect not pipelined


Functional Unit Inputs

• Each ALU inputs come from several sources

• Note that source is locally configurable based on data values


K=8

FIR Filter Example

• For k-weight filter 4K cells needed • One result every 2 cycles

• K/2 8x8 multiplies per cycle


Chess

• HP Labs – Bristol, England• 2-D array – similar to Matrix• Contains more “FPGA-like” routing

resources• No reported software or application

results• Doesn’t support incremental compilation


Chess Interconnect

• More like an FPGA• Takes advantage

of near-neighbor connectivity


• Switchbox memory can be used as storage

• ALU core for computation

Chess Basic Block


Reconfigurable Architecture Workstation

• MIT Computer Architecture Group• Full RISC processor located as

processing element• Routing decoupled into switch mode• Parallelizing compiler used to distribute

work load• Large amount of memory per tile


RAW Tile

• Full functionality in each tile• Static router located for near-neighbor

communication


RAW Datapath


Raw Compiler

• Parallelizes compilation across multiple tiles• Orchestrates communication between tiles• Some dynamic (data dependent) routing

possible


Summary

• Architectures moving in the direction of coarse-grained blocks

• Latest trend is functional pipeline• Communication determined at compile

time• Software support still a major issue