74
EECC722 - Shaaban EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13- Computing System Element Choices Specialization , Development cost/tim Performance/Chip Area Programmability / Flexibility General Purpose Processors Application Specific Processors Re-configurable Hardware ASICs Superscalar VLIW DSPs Network Processors Graphics Processors Reconfigurable Computing FPGAs Micro-coded Arrays GPPs Co-Processors

EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#1 lec # 9 Fall 2004 10-13-2004

Computing System Element Choices

Specialization , Development cost/time Performance/Chip Area

Programmability /Flexibility

GeneralPurpose Processors

ApplicationSpecificProcessors

Re-configurableHardware

ASICs

SuperscalarVLIW

DSPsNetwork ProcessorsGraphics Processors

Reconfigurable ComputingFPGAsMicro-coded Arrays

GPPs

Co-Processors

Page 2: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#2 lec # 9 Fall 2004 10-13-2004

Computing Element Choices Observation • Generality and efficiency are in some sense inversely related to one

another:

– The more general-purpose a computing element is and thus the greater the number of tasks it can perform, the less efficient it will be in performing any of those specific tasks.

– Design decisions are therefore almost always compromises; designers identify key features or applications for which competitive efficiency is a must.

• To counter the problem of computationally intense problems for which general purpose machines cannot achieve the necessary performance:

– Special-purpose processors, attached processors, and coprocessors have been built for many years, especially in such areas as image or signal processing (for which many of the computational tasks can be very well defined).

– The problem with such machines is that they are special-purpose; as problems change or new ideas and techniques develop, their lack of flexibility makes them problematic as long-term solutions.

• Reconfigurable computing using Reconfigurable computing using FPGA (Field Programmable Gate Arrays, first introduced in 1986 by Xilinx) or other reconfigurableeconfigurable hardware can offer an attractive alternative to other computing element choices.

Page 3: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#3 lec # 9 Fall 2004 10-13-2004

What is Reconfigurable Computing?What is Reconfigurable Computing?• Utilize Utilize reconfigurable hardware devices: (spatially-programmed connections of hardware processing elements) tailored to application:

• Customizing hardware computational to a particular application by changing hardware functionality on the fly.

• Reconfigurable Computing GoalReconfigurable Computing Goal: Using reconfigurable hardware devices to build systems with advantages over conventional computing solutions in terms of:

- Flexibility - Performance - Power - Time-to-market - Life cycle cost

“Hardware” customized to specifics of problem.

Direct map of problem specific dataflow, control.

Circuits “adapted” as problem requirements change.

Page 4: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#4 lec # 9 Fall 2004 10-13-2004

Spatial vs. Temporal ComputingSpatial vs. Temporal Computing

Spatial Temporal

ProcessorInstructions

Page 5: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#5 lec # 9 Fall 2004 10-13-2004

Computing Element ProgrammabilityComputing Element Programmability Defining TermsDefining Terms

• Computes one function (e.g. FP-multiply, divider, DCT)

• Function defined at fabrication time

• Computes “any” computable function (e.g. Processor, DSPs, FPGAs)

• Function defined after fabrication

Fixed Function: Programmable:

Parameterizable Hardware:Performs limited “set” of functions

e.g. Co-Processors

Page 6: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#6 lec # 9 Fall 2004 10-13-2004

Conventional Programmable ProcessorsConventional Programmable ProcessorsVs. Configurable devicesVs. Configurable devices

Conventional Programmable Processors• Moderately wide datapath which have been growing larger over time (e.g. 16, 32, 64, 128

bits),

• Support for large on-chip instruction caches which have been also been growing larger over time and can now hold hundreds to thousands of instructions.

• High bandwidth instruction distribution so that several instructions may be issued per cycle at the cost of dedicating considerable die area for instruction distribution

• A single thread of computation control. (SMT changes this)

Configurable devices (such as FPGAs):

• Narrow datapath (e.g. almost always one bit),

• On-chip space for only one instruction per compute element -- i.e. the single instruction which tells the FPGA array cell what function to perform and how to route its inputs and outputs.

• Minimal die area dedicated to instruction distribution such that it takes hundreds of thousands of compute cycles to change the active set of array instructions.

• Can handle regular and bit-level computations more efficiently than processor.

Page 7: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#7 lec # 9 Fall 2004 10-13-2004

Why Reconfigurable Computing?Why Reconfigurable Computing?

• To improve performance over a software implementation.– e.g. signal processing apps in configurable hardware.

• Provide powerful, application-specific operations.

• To improve product flexibility and development cost/time compared to hardware (ASIC) – e.g. encryption, compression or network protocols

handling in configurable hardware

• To use the same hardware for different purposes at different points in the computation (lowers cost).– Given sufficient use of each configuration to tolerate

reconfiguration time/overheads

Page 8: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#8 lec # 9 Fall 2004 10-13-2004

Benefits of Reconfigurable Logic DevicesBenefits of Reconfigurable Logic Devices

• Non-permanent customization and application development after fabrication– “Late Binding”

• Economies of scale (amortize large, fixed design costs)

• Time-to-market (dealing with evolving requirements and standards, new ideas)

Disadvantages

• Efficiency penalty (area, performance, power)

• Correctness Verification

Page 9: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#9 lec # 9 Fall 2004 10-13-2004

Spatial/Configurable Hardware BenefitsSpatial/Configurable Hardware Benefits• 10x raw density advantage over processors

• Potential for fine-grained (bit-level) control --- can offer another order of magnitude benefit.

• Locality.

• Each compute/interconnect resource dedicated to single function

• Must dedicate resources for every computational subtask

• Infrequently needed portions of a computation sit idle --> inefficient use of resources

Spatial/Configurable Drawbacks

Page 10: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#10 lec # 9 Fall 2004 10-13-2004

Configurable Computing Application AreasConfigurable Computing Application Areas• Digital signal processing (except FFT?)• Encryption • Image processing• Telemetry Data processing (remote-sensing)• Data/Image/Video compression/decompression• Low-power (through hardware "sharing")• Scientific/Engineering physical system modeling (e.g. finite-element computations). • Network applications (e.g. reconfigurable routers)• Variable precision arithmetic • Logic-intensive applications• In-the-field hardware enhancements • Adaptive (learning) hardware elements • Rapid system prototyping• Verification of processor and ASIC designs• …...

Page 11: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#11 lec # 9 Fall 2004 10-13-2004

Technology Trends Driving Configurable ComputingTechnology Trends Driving Configurable Computing• Increasing gap between "peak" performance of general-purpose processors and

"average actually achieved" performance. – Most programmers don't write code that gets anywhere near the peak

performance of current superscalar CPUs • Improvements in FPGA hardware: capacity and speed:

– FPGAs use standard SRAM processes and "ride the commodity technology" curve

– Volume pricing even though customized solution • Improvements in synthesis and FPGA mapping/routing software • Increasing number of transistors on a (processor) chip: How to

use them all efficiently? – Bigger caches (Most popular)?– Multiple processor cores? (Chip Multiprocessors - CMPs)– SMT support?– IRAM-style vector/memory?– DSP cores or other application specific processors?– Reconfigurable logic (FPGA or other reconfigurable logic)?

A Combination of the above choices?

Page 12: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#12 lec # 9 Fall 2004 10-13-2004

Configurable Computing Configurable Computing Architectures• Configurable ComputingConfigurable Computing architectures combine elements of general-purpose

computing and application-specific integrated circuits (ASICs).

– The general-purpose processor operates with fixed circuits that perform multiple tasks under the control of software.

– An ASIC contains circuits specialized to a particular task and thus needs little or no software to instruct it.

• The configurable computer can execute software commands that alter its configurable devices (e.g FPGA circuits) as needed to perform a variety of jobs.

Page 13: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#13 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture ComputerHybrid-Architecture Computer • Combines a general-purpose processors (GPPs) and reconfigurable devices

(commonly FPGA chips, or micro-coded arrays of simple processors). • A controller FPGA loads circuit configurations stored in the memory onto the

processor FPGA in response to the requests of the operating program. • If the memory does not contain a requested circuit, the processor FPGA sends a

request to the PC host, which then loads the configuration for the desired circuit. • Common Hybrid Configurable Architecture Today:

– One or more FPGAs on board connected to host via I/O bus (e.g PCI)• Possible Future Hybrid Configurable Architecture:

– Integrate a region of configurable hardware (FPGA or something else) onto processor chip itself as reconfigurable functional units or coprocessors

– Integrate configurable hardware onto DRAM chip=> Flexible computing without memory bottleneck

Page 14: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#14 lec # 9 Fall 2004 10-13-2004

Hybrid-Reconfigurable Computer:

Levels of CouplingDifferent levels of coupling in a hybrid reconfigurable system. Reconfigurable logic is shaded.

Reconfigurable functional units(on chip)

Reconfigurable coprocessor(on or off chip)

Attached (e.g. via PCI) reconfigurable processing unit(Most common today)

External standaloneprocessing unit(e.g. Via network/IO interface)

Future direction

ISA Support Function Calls

Page 15: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#15 lec # 9 Fall 2004 10-13-2004

Sample Configurable Computing Application:Configurable Computing Application:

Prototype Video Communications System • Uses a single FPGA to perform four functions that typically require separate chips.

• A memory chip stores the four circuit configurations and loads them sequentially into the FPGA.

• Initially, the FPGA's circuits are configured to acquire digitized video data.

• The chip is then rapidly reconfigured to transform the video information into a compressed form and reconfigured again to prepare it for transmission.

• Finally, the FPGA circuits are reconfigured to modulate and transmit the video information.

• At the receiver, the four configurations are applied in reverse order to demodulate the data, uncompress the image and then send it to a digital-to-analog converter so it can be displayed on a television screen.

Page 16: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#16 lec # 9 Fall 2004 10-13-2004

Early Configurable Computing SuccessesEarly Configurable Computing Successes• DEC Programmable Active Memories, PAM (1992):

– A universal hardware FPGA-based co-processor closely coupled to a standard host computer developed at DEC's Paris Research Laboratory

– Fast RSA decryption implementation on a reconfigurable machine (10x faster than the fastest ASIC at the time)

• Splash2 (1993): – Attached Processor System using Xilinx FPGAs as processing

elements developed at Center for Computing Sciences.

– Performs DNA Sequence matching 300x Cray2 speed, and 200x a 16K Thinking Machines CM2 speed

• Many modern processors and ASICs are verified using FPGA emulation systems

• For many digital signal processing/filtering (e.g FIR, IIR) algorithms, single chip FPGAs outperform DSPs by 10-100x.

Page 17: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#17 lec # 9 Fall 2004 10-13-2004

Programmable Circuitry: FPGAsProgrammable Circuitry: FPGAs• Field-Programmable Gate Array (FPGA) indoduced by Xilinx (1986)

• Programmable circuits can be created or removed by sending signals to gates in the logic elements.

• A built-in grid of circuits arranged in columns and rows allows the designer to connect a logic element to other logic elements or to an external memory or microprocessor.

• The logic elements are grouped in blocks that perform basic binary operations such as AND, OR and NOT

• Several firms, including Xilinx and Altera, have developed devices with the capability of 4,000,000 or more equivalent gates.

Page 18: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#18 lec # 9 Fall 2004 10-13-2004

Field Programmable Gate Arrays (FPGAs)Field Programmable Gate Arrays (FPGAs)• Chip contains many small building blocks that can be configured to implement

different functions. – These building blocks are known as CLBs (Configurable Logic Blocks)

• FPGAs typically "programmed" by having them read in a stream of configuration information from off-chip

– Typically in-circuit programmable (As opposed to EPLDs -Electrically Programmable Logic Devices- which are typically programmed by removing them from the circuit and using a PROM programmer)

• 25% of an FPGA's gates are application-usable – The rest control the configurability, interconnects, etc.

• As much as 10X clock rate degradation compared to fully custom hardware implementations (ASICs)

• Typically built using SRAM fabrication technology • Since FPGAs "act" like SRAM or logic, they lose their program when they lose power. • Configuration bits need to be reloaded on power-up. • Usually reloaded from a PROM, or downloaded from memory via an I/O bus.

Page 19: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#19 lec # 9 Fall 2004 10-13-2004

Look-Up Table (LUT)Look-Up Table (LUT)

In Out00 001 110 111 0

2-LUT

Mem

In1 In2

Out

• K-LUT -- K input lookup table

• Any function of K inputs by programming table

Page 20: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#20 lec # 9 Fall 2004 10-13-2004

Conventional FPGA TileConventional FPGA Tile

K-LUT (typical k=4) w/ optional output Flip-Flop

Page 21: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#21 lec # 9 Fall 2004 10-13-2004

A Generic Island-style FPGA Routing Architecture

Page 22: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#22 lec # 9 Fall 2004 10-13-2004

XC4000 XC4000 Configurable Logic Blocks (CLB) (CLB)

Cascaded 4 LUTs (2 4-LUTs -> 1 3-LUT)

Page 23: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#23 lec # 9 Fall 2004 10-13-2004

XC4000 Interconnect

Page 24: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#24 lec # 9 Fall 2004 10-13-2004

FPGAs vs. RISC ProcessorsFPGAs vs. RISC ProcessorsComputational Density ComparisonComputational Density Comparison

Page 25: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#25 lec # 9 Fall 2004 10-13-2004

Processor vs. FPGA AreaProcessor vs. FPGA AreaFPGA Processor

Page 26: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#26 lec # 9 Fall 2004 10-13-2004

Programming/Configuring FPGAsProgramming/Configuring FPGAs • Software (vendor supplied device-specific tools) converts

a design to netlist format. – Partition the design into logic blocks

– Then find a good placement for each block and routing between them

• Then a serial bitstream is generated and fed down to the FPGAs themselves

• The configuration bits are loaded into a "long shift register" on the FPGA.

• The output lines from this shift register are control wires that control the behavior of all the CLBs on the chip.

Page 27: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#27 lec # 9 Fall 2004 10-13-2004

Programming/Configuring FPGAsProgramming/Configuring FPGAs

LUTMapping

Placement Routing

BitstreamGeneration

Tech. Indep.Optimization

Config.Data

RTL

Page 28: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#28 lec # 9 Fall 2004 10-13-2004

Reconfigurable Processor Tools Flow

Customer Application / IP

(C code)

ARC Object

Code

C Compiler

RTLHDL

Linker

Executable

C Model Simulator

Configuration Bits

Synthesis & Layout

C DebuggerDevelopment

Board

Page 29: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#29 lec # 9 Fall 2004 10-13-2004

Starting Point

• RTL

– t=A+B

– Reg(t,C,clk);

• Logic

– Oi=AiiCi

– Ci+1 = AiBiBiCiAiCi

Programming/Configuring FPGAsProgramming/Configuring FPGAs

Page 30: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#30 lec # 9 Fall 2004 10-13-2004

LUT MappingProgramming/Configuring FPGAsProgramming/Configuring FPGAs

Page 31: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#31 lec # 9 Fall 2004 10-13-2004

Placement

• Maximize locality– minimize number of wires in each channel

– minimize length of wires

– (but, cannot put everything close)

• Often start by partitioning/clustering

• State-of-the-art finish via simulated annealing

Programming/Configuring FPGAsProgramming/Configuring FPGAs

Page 32: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#32 lec # 9 Fall 2004 10-13-2004

Placement

Programming/Configuring FPGAsProgramming/Configuring FPGAs

Page 33: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#33 lec # 9 Fall 2004 10-13-2004

Routing

• Often done in two passes– Global to determine channel

– Detailed to determine actual wires and switches

• Difficulty is – limited channels

– switchbox connectivity restrictions

Programming/Configuring FPGAsProgramming/Configuring FPGAs

Page 34: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#34 lec # 9 Fall 2004 10-13-2004

RoutingProgramming/Configuring FPGAsProgramming/Configuring FPGAs

Page 35: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#35 lec # 9 Fall 2004 10-13-2004

Overall Configurable Hardware ApproachOverall Configurable Hardware Approach• Select critical portions of an application where hardware customizations will offer an

advantage

• Map those application phases to FPGA hardware – hand-design – VHDL => synthesis

• If it doesn't fit in FPGA, re-select application phase (smaller) and try again.

• Perform timing analysis to determine rate at which configurable design can be clocked.

• Write interface software for communication between main processor and configurable hardware

– Determine where input / output data communicated between software and configurable hardware will be stored

– Write code to manage its transfer (like a procedure call interface in standard software) – Write code to invoke configurable hardware (e.g. memory-mapped I/O)

• Compile software (including interface code)

• Send configuration bits to the configurable hardware

• Run program.

Page 36: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#36 lec # 9 Fall 2004 10-13-2004

Configurable Hardware Configurable Hardware Application Challenges

• This process turns applications programmers into:

– Part-time hardware designers. • Performance analysis problems => what should we put in

hardware? • Hardware-Software Co-design problem • Choice and granularity of computational elements.• Choice and granularity of interconnect network.• Synthesis problems • Testing/reliability problems.

Page 37: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#37 lec # 9 Fall 2004 10-13-2004

The Levels of the Reconfigurable The Levels of the Reconfigurable Computational ElementsComputational Elements

ReconfigurableReconfigurableLogicLogic

ReconfigurableReconfigurableDatapathsDatapaths

adder

buffer

reg0

reg1

muxCLB CLB

CLBCLB

DataMemory

InstructionDecoder

&Controller

DataMemory

ProgramMemory

Datapath

MAC

In

AddrGen

Memory

AddrGen

Memory

ReconfigurableReconfigurableArithmeticArithmetic

ReconfigurableReconfigurableControlControl

Bit-Level Operationse.g. encoding

Dedicated data pathse.g. Filters, AGU

Arithmetic kernelse.g. Convolution

Configurable ProcessorsReal-Time Operating Systems (RTOS):Process management

Page 38: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#38 lec # 9 Fall 2004 10-13-2004

Issues in Using FPGAs for Issues in Using FPGAs for Reconfigurable ComputingReconfigurable Computing

• Hardware-Software Partitioning (co-design)• Run-time reconfiguration overhead

– time to load configuration bitstream -- several seconds

• I/O bandwidth limitations • Speed, power, cost, density (improving)• High-level language support (improving) • Performance, space estimators • Design verification • Partitioning and mapping across several FPGAs• Partial reconfiguration • Configuration caching.

Page 39: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#39 lec # 9 Fall 2004 10-13-2004

Example Reconfigurable ComputingReconfigurable Computing Research

• PRISM (Brown)

• PRISC (Harvard) RC-1

• DPGA-coupled uP (MIT)

• GARP (RC-3), Pleiades, … (UCB)

• OneChip (Toronto) RC-2

• RAW (MIT) RC-4

• REMARC (Stanford) RC-5

• CHIMAERA RC-6 (Northwestern)

• DEC PAM

• Splash 2

• NAPA (NSC)

• E5 etc. (Triscend)

Page 40: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#40 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture RC Compute ModelsHybrid-Architecture RC Compute Models• Unaffected by array logic: Interfacing

– Triscend E5

• Dedicated IO Processor.– NAPA 1000NAPA 1000

• Instruction Augmentation: (Tight Coupling)

– Special Instructions / Coprocessor Ops- - PRISM (Brown, 1991) - PRISC (Harvard, 1994) PRISM (Brown, 1991) - PRISC (Harvard, 1994)

- Chimaera (Northwestern, 1997) - GARP (Berkeley, 1997)- Chimaera (Northwestern, 1997) - GARP (Berkeley, 1997)

– VLIW/microcoded extension to processor

- REMARC (Stanford, 1998)REMARC (Stanford, 1998) • Autonomous co/stream processor

– OneChip (Toronto , 1998)OneChip (Toronto , 1998)

Page 41: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#41 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models:

InterfacingInterfacing

• Logic used in place of

– ASIC environment customization

– External FPGA/PLD devices

• Example

– bus protocols

– peripherals

– sensors, actuators

• Case for:

– Always have some system adaptation to do

– Modern chips have capacity to hold processor + glue logic

– reduce part count

– Glue logic vary

– valued added must now be accommodated on chip (formerly board level)

Page 42: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#42 lec # 9 Fall 2004 10-13-2004

Example: Interface/PeripheralsExample: Interface/Peripherals

• Triscend E5

Page 43: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#43 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: IO ProcessorIO Processor

• Array dedicated to servicing IO channel

– sensor, lan, wan, peripheral

• Provides

– flexible protocol handling

– flexible stream computation• compression, encrypt

• Looks like IO peripheral to processor

• Case for:

– many protocols, services

– only need few at a time

– dedicate attention, offload processor

Page 44: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#44 lec # 9 Fall 2004 10-13-2004

NAPA 1000 Block DiagramNAPA 1000 Block Diagram

RPCReconfigurablePipeline Cntr

ALPAdaptive Logic

Processor

SystemPort

TBTToggleBusTM

Transceiver

PMAPipeline

Memory Array

CR32CompactRISCTM

32 Bit Processor

BIUBus Interface

Unit

CR32PeripheralDevices

ExternalMemoryInterface SMA

ScratchpadMemory Array

CIOConfigurable

I/O

Page 45: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#45 lec # 9 Fall 2004 10-13-2004

NAPA 1000 as IO ProcessorNAPA 1000 as IO Processor

SYSTEMHOST

NAPA1000

ROM &DRAM

ApplicationSpecific

Sensors, Actuators, orother circuits

System Port

CIO

Memory Interface

Page 46: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#46 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: Instruction AugmentationInstruction Augmentation

• Observation: Instruction Bandwidth– Processor can only describe a small number of basic

computations in a cycle• I bits 2I operations

– This is a small fraction of the operations one could do even in terms of www Ops

• w22(2w) operations

– Processor could have to issue w2(2 (2w) -I) operations (instructions) just to describe some computations

– An a priori selected base set of functions (via ISA instructions) could be very bad for some applications

• Motivation for application-specific processors/ISAs

I opcode size W operand word size

Page 47: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#47 lec # 9 Fall 2004 10-13-2004

Instruction AugmentationInstruction Augmentation• Idea:

– Provide a way to augment the processor’s instruction set with operations needed by a particular application

– Close semantic gap / avoid mismatch

• What’s required:– Some way to fit augmented instructions into stream

– Execution engine for augmented instructions• If programmable, has own instructions

– Interconnect to augmented instructions.

Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models:

Page 48: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#48 lec # 9 Fall 2004 10-13-2004

First Effort In Instruction Augmentation:First Effort In Instruction Augmentation: PRISM (Brown, 1991)PRISM (Brown, 1991)

• Processor Reconfiguration through Instruction Set Metamorphosis (PRISM)

• FPGA on bus

• Access as memory mapped peripheral

• Explicit context management

• PRISM-1– 68010 (10MHz) + XC3090

– can reconfigure FPGA in one second

– 50-75 clocks for operations

Instruction AugmentationInstruction Augmentation

Page 49: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#49 lec # 9 Fall 2004 10-13-2004

PRISM-1 ResultsPRISM-1 Results

Raw kernel speedups (IO configuration time not included?)

Page 50: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#50 lec # 9 Fall 2004 10-13-2004

PRISC (Harvard, 1994)PRISC (Harvard, 1994)• Takes next step

– What if we put it on chip?

– How to integrate into processor ISA?

• Architecture:– Couple into register file as “superscalar” functional unit

– Flow-through array (no state)

Instruction AugmentationInstruction Augmentation

(paper RC-1)

Page 51: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#51 lec # 9 Fall 2004 10-13-2004

PRISC ISA IntegrationPRISC ISA Integration

– Add expfu instruction (execute programmable functional unit) to MIPS ISA

– 11 bit address space for user defined expfu instructions

– Fault on pfu instruction mismatch• trap code to service instruction miss

– All operations occur in one clock cycle

– Easily works with processor context switch • no state + fault on mismatch pfu instruction

Page 52: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#52 lec # 9 Fall 2004 10-13-2004

PRISC ResultsPRISC Results

• All compiled

• working from MIPS binary

• <200 4LUTs ?

– 64x3

• 200MHz MIPS base

SPEC92

Page 53: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#53 lec # 9 Fall 2004 10-13-2004

Chimaera (Northwestern, 1997)Chimaera (Northwestern, 1997)

• Start from PRISC idea– Integrate as functional unit

– No state

– RFUOPs (like expfu)

– Stall processor on instruction miss, reload

• Add– Manage multiple instructions loaded

– More than 2 inputs possible

Instruction AugmentationInstruction Augmentation

(paper RC-6)

Page 54: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#54 lec # 9 Fall 2004 10-13-2004

Chimaera ArchitectureChimaera Architecture• “Live” copy of register file

values feed into array

• Each row of array may compute from register values or intermediates (other rows)

• Tag on array to indicate RFUOP

Page 55: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#55 lec # 9 Fall 2004 10-13-2004

Chimaera ArchitectureChimaera Architecture

• Array can compute on values as soon as placed in register file

• Logic is combinational

• When RFUOP matches

– stall until result ready• critical path

– only from late inputs

– Drive result from matching row

Page 56: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#56 lec # 9 Fall 2004 10-13-2004

GARP (Berkeley, 1997)GARP (Berkeley, 1997)

• Integrate as coprocessor– Similar bandwidth to processor

as FU

– Own access to memory

• Support multi-cycle operation– Allow state

– Cycle counter to track operation

• Fast operation selection– Cache for configurations

– Dense encodings, wide path to memory

Instruction AugmentationInstruction Augmentation

(paper RC-3)

Page 57: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#57 lec # 9 Fall 2004 10-13-2004

GARPGARP

• Augmented MIPS ISA -- coprocessor operations– Issue gaconfig to make a particular configuration

resident (may be active or cached)

– Explicitly move data to/from array• 2 writes, 1 read (like FU, but not 2W+1R)

– Processor suspend during co-processor operation• Cycle count tracks operation

– Array may directly access memory• Processor and array share memory space

– cache/mmu keeps consistent between

• Can exploit streaming data operations

Page 58: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#58 lec # 9 Fall 2004 10-13-2004

GARP Processor InstructionsGARP Processor Instructions

Augmented to MIPS ISA

Page 59: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#59 lec # 9 Fall 2004 10-13-2004

GARP ArrayGARP Array

• Row oriented logic

– Denser for datapath operations

• Dedicated path for

– Processor/memory data

• Processor does not have to be involved in array memory path.

Page 60: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#60 lec # 9 Fall 2004 10-13-2004

GARP ResultsGARP Results

• General results

– 10-20x on stream, feed-forward operation

– 2-3x when data-dependencies limit pipelining

Page 61: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#61 lec # 9 Fall 2004 10-13-2004

PRISC/Chimera vs. GARPPRISC/Chimera vs. GARP

• PRISC/Chimaera

– Basic op is single cycle: expfu (rfuop)

– No state

– could conceivably have multiple PFUs?

– Discover parallelism => run in parallel?

– Can’t run deep pipelines

• GARP

– Basic op is multicycle• gaconfig• mtga• mfga

– Can have state/deep pipelining

– Multiple arrays viable?

– Identify mtga/mfga w/ corr gaconfig?

Page 62: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#62 lec # 9 Fall 2004 10-13-2004

Common Instruction Augmentation FeaturesCommon Instruction Augmentation Features

• To get around instruction expression limits:– Define new instruction in array

• Many bits of config … broad expressability

• many parallel operators

– Give array configuration short “name” which processor can callout (augmented instructions)

• …effectively the address of the operation

Page 63: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#63 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: VLIW/microcoded ModelVLIW/microcoded Model

• Similar to instruction augmentation

• Single tag (address, instruction) – controls a number of more basic operations

• Some difference in expectation– can sequence a number of different tags/operations

together

Page 64: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#64 lec # 9 Fall 2004 10-13-2004

REMARC (Stanford, 1998)REMARC (Stanford, 1998)• Array of “nano-processors”

– 16b, 32 instructions each

– VLIW like execution, global sequencer

• Coprocessor interface (similar to GARP)– No direct array memory

VLIW/microcoded ModelVLIW/microcoded Model

(paper RC-5)

Page 65: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#65 lec # 9 Fall 2004 10-13-2004

REMARC ArchitectureREMARC Architecture• Issue coprocessor rex

– global controller sequences nanoprocessors

– multiple cycles (microcode)

• Each nanoprocessor has own I-store (VLIW)

Page 66: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#66 lec # 9 Fall 2004 10-13-2004

REMARC ResultsREMARC Results

MPEG2

DES

Page 67: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#67 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: Observation Observation

• All single threaded– Limited to parallelism

• instruction level (VLIW, bit-level)

• data level (vector/stream/SIMD)

– No task/thread level parallelism• Except for IO dedicated task parallel with processor task

Page 68: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#68 lec # 9 Fall 2004 10-13-2004

Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: Autonomous CoroutineAutonomous Coroutine

• Array task is decoupled from processor– Fork operation / join upon completion

• Array has own – Internal state

– Access to shared state (memory)

• NAPA supports to some extent– Task level, at least, with multiple devices

Page 69: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#69 lec # 9 Fall 2004 10-13-2004

OneChip (Toronto , 1998)OneChip (Toronto , 1998)

• Want array to have direct memorymemory operations

• Want to fit into programming model/ISA– without forcing exclusive processor/FPGA operation

– allowing decoupled processor/array execution

• Key Idea:– FPGA operates on memory memory regions

– Make regions explicit to processor issue

– scoreboard memory blocks

(paper RC-2)

Page 70: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#70 lec # 9 Fall 2004 10-13-2004

OneChip PipelineOneChip Pipeline

Page 71: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#71 lec # 9 Fall 2004 10-13-2004

OneChip CoherencyOneChip Coherency

Page 72: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#72 lec # 9 Fall 2004 10-13-2004

OneChip InstructionsOneChip Instructions

• Basic Operation is:– FPGA MEM[Rsource]MEM[Rdst]

• block sizes powers of 2

• Supports 14 “loaded” functions– DPGA/contexts so 4 can be cached

Page 73: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#73 lec # 9 Fall 2004 10-13-2004

OneChipOneChip

• Basic op is: FPGA MEM MEM

• No state between these ops

• coherence is that ops appear sequential

• could have multiple/parallel FPGA Compute units– scoreboard with processor and each other

• Can’t chain FPGA operations?

Page 74: EECC722 - Shaaban #1 lec # 9 Fall 2004 10-13-2004 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability

EECC722 - ShaabanEECC722 - Shaaban#74 lec # 9 Fall 2004 10-13-2004

SummarySummary• Several different models and uses for a “Reconfigurable

Processor”:– On computational kernels

• seen the benefits of coarse-grain interaction– GARP, REMARC, OneChip

– Missing: • More full application (multi-application) benefits of these

architectures...

• Exploit density and expressiveness of fine-grained, spatial operations

• Number of ways to integrate cleanly into processor architecture…and their limitations