ENG3050 Embedded Reconfigurable Computing Systems Review Review & Final Exam

ENG3050 Embedded Reconfigurable

Computing Systems

ReviewReview

& Final Exam& Final Exam

2 2

Course Objectives

Achieves the following goals:1. Gives an overview of the traditional Von Neumann

Computer Architecture, its specifications, design and implementations and main drawbacks. Techniques to improve the performance.

2. Teaches you the internal structure of Programmable Logic in general and Field Programmable Gate Arrays in particular.

3. Teaches you how digital circuits are designed today using advanced CAD tools and HDLs and high level languages.

4. Teaches you the basic concepts of Reconfigurable Computing systems (Hardware/Software co-design)

5. Teaches you when/how to apply Reconfigurable Computing Concepts to design efficient, reliable, robust systems (DSP).

6. Understand the concept of Run Time Reconfiguration.

3

Final Exam Final Exam

Final Exam (30%)Final Exam (30%) Questions that identify your understanding of concepts introduced. 2 hours long 6 Questions Only a few formulas are introduced in the course (memorize them)

Project (20%)Project (20%) Demo Report (Can submit on Saturday April 4th)

4

Reconfigurable Computing: DefinitionReconfigurable Computing: Definition

Reconfigurable Computing (RC) is a computing paradigm

where programmable logic devices are used to accelerate accelerate computationscomputations or applicationsapplications by exploiting parallelism at different levels (bit, instruction level, architectural)

in which Algorithms are implemented as a temporallytemporally and spatiallyspatially ordered set of very complex tasks.

What is meant by temporal and spatial implementations?What is meant by temporal and spatial implementations?

Spatial vs. Temporal Computing

5

Ax2 + Bx + c (Ax + B)x + C

Spatial (ASIC) Temporal (Processor)

Methods for executing algorithmsMethods for executing algorithms

Advantages:•very high performance and efficientDisadvantages:•not flexible (can’t be altered after fabrication)• expensive

Hardware(Application Specific Integrated Circuits)

Software-programmed processors

Advantages:•software is very flexible to changeDisadvantages:•performance can suffer if clock is not fast•fixed instruction set by hardware

Reconfigurablecomputing

Advantages:•fills the gap fills the gap between hardware between hardware and software and software •much higher performance than software•higher level of flexibility than hardware 6

7

The Von Neumann ComputerThe Von Neumann Computer

Advantage: Flexibility: any well coded program can be executed

Drawbacks Speed efficiency: Not efficient, due to the sequential

program execution (temporal resource sharing). Resource efficiency: Only one part of the hardware

resources is required for the execution of an instruction. The rest remains idle.

Memory access: Memories are about 10 time slower than the processor

How to compensate for deficiencies?

8

Improving Performance of VN (GPPs)Improving Performance of VN (GPPs)1. Technology Scaling

Improve performance (increase clock frequency!)

2. Improving Instruction Set of Processor3. Application Specific Processors (DSP)4. Use of Hierarchical Memory System

Cache can enhance speed

5. Multiplicity of Functional Units (H/W) Adders/Multipliers/Dividers (CDC-6600)

6. Pipelining within CPU (H/W) A four stage pipeline stage (IF/ID/OF/EX)

7. Overlap CPU & I/O Operations (H/W) DMA (Direct Memory Access) can be used to enhance performance

8. Time Sharing (SW) Multi-tasking assigns fixed or variable time slices to multiple programs

9. Parallelism & Multithreading (S/W) (H/W) Compilers/Multi-core systems

9

Exploiting ParallelismExploiting Parallelism

• Bit level Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors

• Instruction level Instruction level parallelism (ILP): ~1985 through today

– Pipelining

– Superscalar

– Limits to benefits of ILP?

• Process Level Process Level or Thread level parallelism; mainstream for general purpose computing?

– Servers are parallel

– High-end Desktop dual processor PC

10

PipeliningPipeliningExploits parallelism at the instruction level.Pipelining is an implementation technique in

which multiple instructions are overlapped in execution.Today pipelining is keypipelining is key to making processors fast.Pipelining is not only used in General Purpose

processors but can also be used in hardware hardware acceleratorsaccelerators.

11

Speed Up Speed Up

stages pipe ofNumber

nsinstructiobetween Time nsinstructiobetween Time

ned)(nonpipeli

)(pipelined

If the stages are perfectly balanced, then the time If the stages are perfectly balanced, then the time between instructions on the pipelined processor – between instructions on the pipelined processor – assuming ideal conditions – is equal to:assuming ideal conditions – is equal to:

Under ideal conditionsUnder ideal conditions and with a large number of and with a large number of instructions, the speedup from pipelining is instructions, the speedup from pipelining is approximately equal to the number of pipe stage, i.e., a approximately equal to the number of pipe stage, i.e., a five stage pipeline is nearly five times faster.five stage pipeline is nearly five times faster.

Pipelining Pipelining improves performance by increasing improves performance by increasing instruction throughputinstruction throughput, as opposed to decreasing the , as opposed to decreasing the execution time of an individual instruction.execution time of an individual instruction.

Parallel ProcessingParallel Processing Using more than one processor to solve a

problem Idea is that n processorsn processors operating simultaneously can

achieve the result n times faster.

Motives Diminishing returns from ILP

Limited ILP in programs ILP increasingly expensive to exploit

Fault tolerance Large amount of memory available

12

13

Flynn’s TaxonomyFlynn’s TaxonomyInstructions

Single (SI) Multiple (MI)D

ata

Mu

ltip

le (

MD

)SISD

Single-threaded process

MISD

Pipeline architecture

SIMD

Vector Processing

MIMD

Multi-threaded Programming

Sin

gle

(S

D)

14

Speedup factor

S(n) = Execution time on a single processor

Execution time on a multiprocessor with n processors

S(n) gives increase in speed by using a multiprocessor

Speedup factor can also be cast in terms of computational steps

S(n) = Number of steps using one processor

Number of parallel steps using n processors

Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved. WHY?WHY?

15

Amdahl’s Law• Amdahl's law, is used to find the maximum expected improvement

to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical predict the theoretical maximum maximum speedupspeedup using `n’ processors.

• The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.

http://en.wikipedia.org/wiki/Parallel_computing

http://en.wikipedia.org/wiki/Speedup

16

Issues in Configurable DesignIssues in Configurable Design

1. Reconfigurable Hardware Architecture (FPGA) Choice and granularity of computational elements Issues related to performance, area and power consumption

2. Design Entry Techniques Low Level (VHDL) High Level (ESL e.g. Handel-C)

3. Support of efficient CAD tools High Level Synthesis, Logic Optimization, Mapping, Place & Route

4. Coupling Approaches Tightly coupled vs. loosely coupled

5. Area versus Performance Serial, semi-parallel, parallel Floating Point, Fixed Point

6. Reconfiguration time and rate Static versus dynamic reconfiguration (area, performance, approaches)

17

Benefits of Reconfigurable SystemBenefits of Reconfigurable System

A trade-off betweentrade-off between traditional hardware (performance) and software (flexibility) Hardware-like performance with software-like flexibility

Hardware can be modified on-the-flymodified on-the-fly Can accommodate any architecture (resource management) Changes can be made in the field

Orders of magnitude performanceOrders of magnitude performance improvements over Software traditional systems

Programming can be achieved at different levels of different levels of abstractionabstraction HDL (VHDL/Verilog) C/C++/Matlab

Remember!

18

Pro

gram

ma

bleP

rogra

mm

able

Loo

kup T

ables (LU

Ts)

Loo

kup T

ables (LU

Ts)

Pro

gram

ma

bleP

rogra

mm

able

routin

g structu

rero

uting stru

cture

Main bottleneck with state-of-the-art fine grain FPGAs is the

routing enabled by pass transistors!

Remember!

19

Pro

gram

ma

bleP

rogra

mm

able

Loo

kup T

ables (LU

Ts)

Loo

kup T

ables (LU

Ts)

Pro

gram

ma

bleP

rogra

mm

able

routin

g structu

rero

uting stru

cture

LUTxyz f

...

fSRAM

x

y

z...

001

0

...1

Look-up-tables are flexible but require lots of configuration and suffer from power dissipation!

20

Evolution of the FPGA Early FPGAs were used mainly for “glue logic” between other

components Simple CLBs, small number of inputs Focus was on implementing “random” logic efficiently

As capacities grew, other applications emerged FPGAs as alternative to custom IC’s for entire applications Emulation of ASICs. Computing with FPGAs

FPGAs have changed to meet new application demands Carry chains, better support for multi-bit operations Integrated memories, such as the Block RAMsBlock RAMs. Specialized units, such as multipliersmultipliers, to implement

functions that are slow/inefficient in CLBs Clock Managers Clock Managers to control the Frequency of Operation Newer devices incorporate entire CPUsentire CPUs: Xilinx Virtex II Pro

has 1-4 Power PC CPUs

21

Design Entry

Logic Logic OptimizationOptimization

Synthesis

Mapping to k-LUT

Packing LUTs to CLBs

Placement

Routing Configure an FPGA

Simulation

CAD for FPGAsCAD for FPGAs

22

FPGA Placement Problem• Input – A technology mapped netlist of Configurable

Logic Blocks (CLB) realizing a given circuit.

• Output – CLB netlist placed in a two dimensional array of slots such that total wirelength is minimized.

CLB Netlist

i1 i2 i3 i4

f1 f2

1 2 3

4 5 6 7 8

9 10

FPGA

PlacementPlacement

i1 i2 i3

i4

f2

f1

1

2

3

4

56

7

8

9

10

23

Global vs. Detailed RoutingGlobal vs. Detailed Routing

Global routing

LB LB LB

SB SB

LB LB LB

SB SB

LB LB LB

SB

SB

LB LB LB

SB SB

LB LB LB

SB SB

LB LB LB

SB

SB

Detailed routing

24

VHDL for SynthesisVHDL for Synthesis

VHDL for VHDL for SimulationSimulation

VHDL for VHDL for SynthesisSynthesis

Only a subset of the VHDL language is synthesizable The VHDL subset that is synthesizable is tool specific!

– Do not expect your VHDL description to be synthesizable with another tool.

Understand what is meant for simulation versus what is meant for producing hardware (i.e. synthesis)

25

VHDL Hardware VHDL Hardware CorrespondenceCorrespondence

26

Abstraction: AdvantagesAbstraction: Advantages

FPGA Tool Flow with ESLFPGA Tool Flow with ESL

HDL

Netlist

Bitfile

Processor FPGA

RT Synthesis

Physical Design

Technology Mapping

Placement

Routing

High-level Synthesis

C/C++, Java, etc.

27

High-Level Synthesis: HLS

• High-Level Synthesis– Creates an RTL implementation from C

level source code

– Extracts control and dataflow from Extracts control and dataflow from the source codethe source code

– Implements the design Implements the design based on defaults and user applied directivesuser applied directives

• Many implementation are possible from the same source description– Smaller designs, faster designs, optimal

designs

– Enables manualmanual design exploration

RCS - Winter 201511- 28

Vivado HLSVivado HLS

Script withConstraintsScript withConstraints

RTL Wrapper

RTL Wrapper

………………………………………

………………………VHDLVerilog

System C

VHDLVerilog

System C

AutoESLAutoESL

Test benchTest

bench Constraints/ Directives

Constraints/ Directives

………………

………………

………………

………………C, C++,

SystemC

C, C++, System

C

RTL SimulationRTL Simulation RTL SynthesisRTL Synthesis

29

Different levels of couplingDifferent levels of coupling

FU

Workstation

Coprocessor

CPU Memory Caches

I/O Interfac

e

Standalone Processing Unit

Attached Processing Unit

Tightly CoupledTightly Coupled

Loosely CoupledLoosely Coupled

30

HW/SW Co-designHW/SW Co-design

SW__________________

SW__________________

SW__________________

HW__________________

SW__________________

SW__________________

ProcessorProcessor ProcessorASIC/FPGA

Critical Regions

ProfilerProfiler Benefits Speedups of 2X to

10X typical Far more potential

than dynamic SW optimizations (1.2x)

Energy reductions of 25% to 95% typical

Time Energy

SW OnlyHW/ SW

Time Energy

SW Only

ProcessorProcessor

31

Processor Specialization: (ASIP)

Gains!

32

Exploiting Parallelism

33

ReconfigurabilityReconfigurability Reconfiguration is either SStatictatic (execution is interrupted), or Partial Partial

reconfigurablereconfigurable or Partial Dynamic Partial Dynamic (in parallel with execution):

Static Configuration:Static Configuration: involves hardware changes at the slow rate of days/weeks,

Partial ReconfigurationPartial Reconfiguration : Only a subset of configuration data is altered , But all computation halts while

modification is in progress…

Partial Dynamic reconfiguration Partial Dynamic reconfiguration : The hardware reconfigures itself on the fly as it executes a task, refining its

own programming for improved performance.

34

Partial Dynamic ReconfigurationPartial Dynamic Reconfiguration

Main Task: Dividing algorithm into time-exclusive segments or spatial segments

Types:Time exclusive segments (Temporal Partitioning)

Complete Partitions swapped in and out.Spatial segments:

Partial segments can be swapped in and out.

Partial Reconfiguration Technology and Benefits

Partial Reconfiguration enables: System Flexibility

Perform more functions while maintaining communication links

Size and Cost Reduction Time-multiplex the hardware

to require a smaller FPGA

Power Reduction Shut down power-hungry tasks

when not needed

Fault Tolerant/self-repairing Systems Adaptive System

35

Power Reduction Techniques with PR Many techniques can be employed to reduce power

Swap out Swap out high-power functions for low-power functions low-power functions when maximum performance is not required

Swap out Swap out high-power I/O standards for lower-power I/O lower-power I/O when specific characteristics are not needed

Swap out Swap out black boxes for inactive regionsinactive regions Time-multiplexing functions Time-multiplexing functions will reduce power by reducing

amount of configured logic

36

Con

trol

ler

(Mic

rob

laze

)

ICAP

Fla

sh c

ontr

olle

r

Current PR Design Flow

• Steps– Partition the system into modules– Define static modules and reconfigurable

modules – Decide the number of PR regions (PRRs)– Decide PRR sizes, shapes and locations – Map modules to PRRs– Define PRR interfaces, instantiate slice

macros for PRR interfaces

• Optimization problems– Design partitioning– Number of PRRs– PRR sizes, shapes and locations – Mapping PRMs to PRRs– Type and placement of PRR interfaces

Module AModule C

Module B

Static modules Reconfigurable Modules (PRMs)

12

FP

GA

# of PRRs

PRR 1

PRR 2

Sta

tic

regi

on

Static modules

Modules: A and B

Modules: C

Des

ign

part

itio

ning

Des

ign

floo

rpla

nnin

g an

d bu

dget

ing

37

Reconfigurable Operating Systems Modern FPGA can be partially dynamically reconfigured. This

feature adds tremendous flexibility adds tremendous flexibility to the Reconfigurable Computing (RC) Field but also introduces challenges. also introduces challenges.

Reconfigurable Operating Systems tend to: Ease applications development, Higher lever of abstraction Ease application verification and maintenance.

38

UoG OS for Reconfigurable Computing UoG OS for Reconfigurable Computing

Phase I:Phase I: RTR Platform Scheduling Algorithms DFG Generator

Phase II:Phase II: Single GA for Efficient Task

Mapping Island Based GA (further

enhancement)

Phase III:Phase III: Resource Prediction with

Machine Learning

39

GA Task Allocation: Single GAGA Task Allocation: Single GA

40

GA Task Allocation: PlatformsGA Task Allocation: Platforms

41

Several types of platforms can be utilized for different applications. Each platform used can be more appropriate for a different

application in terms of resources used, power consumed and performance achieved.

Running a single GA is based on a single platform. What happens if we changed the platform and ran the GA for task

allocation?

GA Task Allocation: Island BasedGA Task Allocation: Island Based

42

An Island Based GA should overcome the limitations of a single GA implementation.

Each Island consists of a GA engine and a unique platform.

The Islands run in parallel to optimize the performance and power consumption based on a specific platform.

Every Island produces a separate Pareto front based on the platform it is using for its intended population.

An aggregation modules would combine all the Pareto fronts into an aggregated solution.

Prediction of ResourcesPrediction of Resources

43

1. The data preparation Module: Benchmarks are generated and evaluated

in terms of the power consumed and execution time using a simulator.

All necessary features that will be used to train the machine learning algorithms are extracted

2. Training Module: The framework utilizes the features

extracted from the previous stage to train and create a model that learns from previous historic information.

This module extracts useful hidden knowledge from the data and estimates/predicts resources of the reconfigurable computing system.

3. Prediction Module: Based on the model developed in the

Training Phase the module attempts to predict resources for unseen task graphs.

Design Space ExplorationDesign Space Exploration

What?: What?: Rapid exploration of various architectural solutions to be implemented on reconfigurable architectures in order to select the select the most efficient architecture most efficient architecture for one or several applications

When?: When?: Take place before architectural synthesis before architectural synthesis (algorithmic specification with high level abstraction language)

How?How?: Estimations are based on a functional architecture model (generic, technology-independenttechnology-independent)

Iterative exploration flowIterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model

44

45

Framework OverviewFramework Overview

Core Library

Communication Sub-System

Communication Sub-System

Embedded Processors

Dedicated Hardware

Accelerator

Point to Point

Common Bus

User Constraints

User ApplicationModeling Tool

Designer

Mathematical &

Meta-Heuristics Search the design space

Evaluation of the Generation Architecture

Generated ArchitectureEvaluation Results

ASIC

FPGA

CGRA

Implementation Platform

Proposed Architecture

Documents

ENG3050 Embedded Reconfigurable Computing Systems Review Review & Final Exam