Upload
paul-phelps
View
227
Download
0
Embed Size (px)
Citation preview
ENG3050 Embedded Reconfigurable
Computing Systems
ReviewReview
& Final Exam& Final Exam
2 2
Course Objectives
Achieves the following goals:1. Gives an overview of the traditional Von Neumann
Computer Architecture, its specifications, design and implementations and main drawbacks. Techniques to improve the performance.
2. Teaches you the internal structure of Programmable Logic in general and Field Programmable Gate Arrays in particular.
3. Teaches you how digital circuits are designed today using advanced CAD tools and HDLs and high level languages.
4. Teaches you the basic concepts of Reconfigurable Computing systems (Hardware/Software co-design)
5. Teaches you when/how to apply Reconfigurable Computing Concepts to design efficient, reliable, robust systems (DSP).
6. Understand the concept of Run Time Reconfiguration.
3
Final Exam Final Exam
Final Exam (30%)Final Exam (30%) Questions that identify your understanding of concepts introduced. 2 hours long 6 Questions Only a few formulas are introduced in the course (memorize them)
Project (20%)Project (20%) Demo Report (Can submit on Saturday April 4th)
4
Reconfigurable Computing: DefinitionReconfigurable Computing: Definition
Reconfigurable Computing (RC) is a computing paradigm
where programmable logic devices are used to accelerate accelerate computationscomputations or applicationsapplications by exploiting parallelism at different levels (bit, instruction level, architectural)
in which Algorithms are implemented as a temporallytemporally and spatiallyspatially ordered set of very complex tasks.
What is meant by temporal and spatial implementations?What is meant by temporal and spatial implementations?
Spatial vs. Temporal Computing
5
Ax2 + Bx + c (Ax + B)x + C
Spatial (ASIC) Temporal (Processor)
Methods for executing algorithmsMethods for executing algorithms
Advantages:•very high performance and efficientDisadvantages:•not flexible (can’t be altered after fabrication)• expensive
Hardware(Application Specific Integrated Circuits)
Software-programmed processors
Advantages:•software is very flexible to changeDisadvantages:•performance can suffer if clock is not fast•fixed instruction set by hardware
Reconfigurablecomputing
Advantages:•fills the gap fills the gap between hardware between hardware and software and software •much higher performance than software•higher level of flexibility than hardware 6
7
The Von Neumann ComputerThe Von Neumann Computer
Advantage: Flexibility: any well coded program can be executed
Drawbacks Speed efficiency: Not efficient, due to the sequential
program execution (temporal resource sharing). Resource efficiency: Only one part of the hardware
resources is required for the execution of an instruction. The rest remains idle.
Memory access: Memories are about 10 time slower than the processor
How to compensate for deficiencies?
8
Improving Performance of VN (GPPs)Improving Performance of VN (GPPs)1. Technology Scaling
Improve performance (increase clock frequency!)
2. Improving Instruction Set of Processor3. Application Specific Processors (DSP)4. Use of Hierarchical Memory System
Cache can enhance speed
5. Multiplicity of Functional Units (H/W) Adders/Multipliers/Dividers (CDC-6600)
6. Pipelining within CPU (H/W) A four stage pipeline stage (IF/ID/OF/EX)
7. Overlap CPU & I/O Operations (H/W) DMA (Direct Memory Access) can be used to enhance performance
8. Time Sharing (SW) Multi-tasking assigns fixed or variable time slices to multiple programs
9. Parallelism & Multithreading (S/W) (H/W) Compilers/Multi-core systems
9
Exploiting ParallelismExploiting Parallelism
• Bit level Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors
• Instruction level Instruction level parallelism (ILP): ~1985 through today
– Pipelining
– Superscalar
– Limits to benefits of ILP?
• Process Level Process Level or Thread level parallelism; mainstream for general purpose computing?
– Servers are parallel
– High-end Desktop dual processor PC
10
PipeliningPipeliningExploits parallelism at the instruction level.Pipelining is an implementation technique in
which multiple instructions are overlapped in execution.Today pipelining is keypipelining is key to making processors fast.Pipelining is not only used in General Purpose
processors but can also be used in hardware hardware acceleratorsaccelerators.
11
Speed Up Speed Up
stages pipe ofNumber
nsinstructiobetween Time nsinstructiobetween Time
ned)(nonpipeli
)(pipelined
If the stages are perfectly balanced, then the time If the stages are perfectly balanced, then the time between instructions on the pipelined processor – between instructions on the pipelined processor – assuming ideal conditions – is equal to:assuming ideal conditions – is equal to:
Under ideal conditionsUnder ideal conditions and with a large number of and with a large number of instructions, the speedup from pipelining is instructions, the speedup from pipelining is approximately equal to the number of pipe stage, i.e., a approximately equal to the number of pipe stage, i.e., a five stage pipeline is nearly five times faster.five stage pipeline is nearly five times faster.
Pipelining Pipelining improves performance by increasing improves performance by increasing instruction throughputinstruction throughput, as opposed to decreasing the , as opposed to decreasing the execution time of an individual instruction.execution time of an individual instruction.
Parallel ProcessingParallel Processing Using more than one processor to solve a
problem Idea is that n processorsn processors operating simultaneously can
achieve the result n times faster.
Motives Diminishing returns from ILP
Limited ILP in programs ILP increasingly expensive to exploit
Fault tolerance Large amount of memory available
12
13
Flynn’s TaxonomyFlynn’s TaxonomyInstructions
Single (SI) Multiple (MI)D
ata
Mu
ltip
le (
MD
)SISD
Single-threaded process
MISD
Pipeline architecture
SIMD
Vector Processing
MIMD
Multi-threaded Programming
Sin
gle
(S
D)
14
Speedup factor
S(n) = Execution time on a single processor
Execution time on a multiprocessor with n processors
S(n) gives increase in speed by using a multiprocessor
Speedup factor can also be cast in terms of computational steps
S(n) = Number of steps using one processor
Number of parallel steps using n processors
Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved. WHY?WHY?
15
Amdahl’s Law• Amdahl's law, is used to find the maximum expected improvement
to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical predict the theoretical maximum maximum speedupspeedup using `n’ processors.
• The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.
16
Issues in Configurable DesignIssues in Configurable Design
1. Reconfigurable Hardware Architecture (FPGA) Choice and granularity of computational elements Issues related to performance, area and power consumption
2. Design Entry Techniques Low Level (VHDL) High Level (ESL e.g. Handel-C)
3. Support of efficient CAD tools High Level Synthesis, Logic Optimization, Mapping, Place & Route
4. Coupling Approaches Tightly coupled vs. loosely coupled
5. Area versus Performance Serial, semi-parallel, parallel Floating Point, Fixed Point
6. Reconfiguration time and rate Static versus dynamic reconfiguration (area, performance, approaches)
17
Benefits of Reconfigurable SystemBenefits of Reconfigurable System
A trade-off betweentrade-off between traditional hardware (performance) and software (flexibility) Hardware-like performance with software-like flexibility
Hardware can be modified on-the-flymodified on-the-fly Can accommodate any architecture (resource management) Changes can be made in the field
Orders of magnitude performanceOrders of magnitude performance improvements over Software traditional systems
Programming can be achieved at different levels of different levels of abstractionabstraction HDL (VHDL/Verilog) C/C++/Matlab
Remember!
18
Pro
gram
ma
bleP
rogra
mm
able
Loo
kup T
ables (LU
Ts)
Loo
kup T
ables (LU
Ts)
Pro
gram
ma
bleP
rogra
mm
able
routin
g structu
rero
uting stru
cture
Main bottleneck with state-of-the-art fine grain FPGAs is the
routing enabled by pass transistors!
Remember!
19
Pro
gram
ma
bleP
rogra
mm
able
Loo
kup T
ables (LU
Ts)
Loo
kup T
ables (LU
Ts)
Pro
gram
ma
bleP
rogra
mm
able
routin
g structu
rero
uting stru
cture
LUTxyz f
...
fSRAM
x
y
z...
001
0
...1
Look-up-tables are flexible but require lots of configuration and suffer from power dissipation!
20
Evolution of the FPGA Early FPGAs were used mainly for “glue logic” between other
components Simple CLBs, small number of inputs Focus was on implementing “random” logic efficiently
As capacities grew, other applications emerged FPGAs as alternative to custom IC’s for entire applications Emulation of ASICs. Computing with FPGAs
FPGAs have changed to meet new application demands Carry chains, better support for multi-bit operations Integrated memories, such as the Block RAMsBlock RAMs. Specialized units, such as multipliersmultipliers, to implement
functions that are slow/inefficient in CLBs Clock Managers Clock Managers to control the Frequency of Operation Newer devices incorporate entire CPUsentire CPUs: Xilinx Virtex II Pro
has 1-4 Power PC CPUs
21
Design Entry
Logic Logic OptimizationOptimization
Synthesis
Mapping to k-LUT
Packing LUTs to CLBs
Placement
Routing Configure an FPGA
Simulation
CAD for FPGAsCAD for FPGAs
22
FPGA Placement Problem• Input – A technology mapped netlist of Configurable
Logic Blocks (CLB) realizing a given circuit.
• Output – CLB netlist placed in a two dimensional array of slots such that total wirelength is minimized.
CLB Netlist
i1 i2 i3 i4
f1 f2
1 2 3
4 5 6 7 8
9 10
FPGA
PlacementPlacement
i1 i2 i3
i4
f2
f1
1
2
3
4
56
7
8
9
10
23
Global vs. Detailed RoutingGlobal vs. Detailed Routing
Global routing
LB LB LB
SB SB
LB LB LB
SB SB
LB LB LB
SB
SB
LB LB LB
SB SB
LB LB LB
SB SB
LB LB LB
SB
SB
Detailed routing
24
VHDL for SynthesisVHDL for Synthesis
VHDL for VHDL for SimulationSimulation
VHDL for VHDL for SynthesisSynthesis
Only a subset of the VHDL language is synthesizable The VHDL subset that is synthesizable is tool specific!
– Do not expect your VHDL description to be synthesizable with another tool.
Understand what is meant for simulation versus what is meant for producing hardware (i.e. synthesis)
25
VHDL Hardware VHDL Hardware CorrespondenceCorrespondence
26
Abstraction: AdvantagesAbstraction: Advantages
FPGA Tool Flow with ESLFPGA Tool Flow with ESL
HDL
Netlist
Bitfile
Processor FPGA
RT Synthesis
Physical Design
Technology Mapping
Placement
Routing
High-level Synthesis
C/C++, Java, etc.
27
High-Level Synthesis: HLS
• High-Level Synthesis– Creates an RTL implementation from C
level source code
– Extracts control and dataflow from Extracts control and dataflow from the source codethe source code
– Implements the design Implements the design based on defaults and user applied directivesuser applied directives
• Many implementation are possible from the same source description– Smaller designs, faster designs, optimal
designs
– Enables manualmanual design exploration
RCS - Winter 201511- 28
Vivado HLSVivado HLS
Script withConstraintsScript withConstraints
RTL Wrapper
RTL Wrapper
………………………………………
………………………VHDLVerilog
System C
VHDLVerilog
System C
AutoESLAutoESL
Test benchTest
bench Constraints/ Directives
Constraints/ Directives
………………
………………
………………
………………C, C++,
SystemC
C, C++, System
C
RTL SimulationRTL Simulation RTL SynthesisRTL Synthesis
29
Different levels of couplingDifferent levels of coupling
FU
Workstation
Coprocessor
CPU Memory Caches
I/O Interfac
e
Standalone Processing Unit
Attached Processing Unit
Tightly CoupledTightly Coupled
Loosely CoupledLoosely Coupled
30
HW/SW Co-designHW/SW Co-design
SW__________________
SW__________________
SW__________________
HW__________________
SW__________________
SW__________________
ProcessorProcessor ProcessorASIC/FPGA
Critical Regions
ProfilerProfiler Benefits Speedups of 2X to
10X typical Far more potential
than dynamic SW optimizations (1.2x)
Energy reductions of 25% to 95% typical
Time Energy
SW OnlyHW/ SW
Time Energy
SW Only
ProcessorProcessor
31
Processor Specialization: (ASIP)
Gains!
32
Exploiting Parallelism
33
ReconfigurabilityReconfigurability Reconfiguration is either SStatictatic (execution is interrupted), or Partial Partial
reconfigurablereconfigurable or Partial Dynamic Partial Dynamic (in parallel with execution):
Static Configuration:Static Configuration: involves hardware changes at the slow rate of days/weeks,
Partial ReconfigurationPartial Reconfiguration : Only a subset of configuration data is altered , But all computation halts while
modification is in progress…
Partial Dynamic reconfiguration Partial Dynamic reconfiguration : The hardware reconfigures itself on the fly as it executes a task, refining its
own programming for improved performance.
34
Partial Dynamic ReconfigurationPartial Dynamic Reconfiguration
Main Task: Dividing algorithm into time-exclusive segments or spatial segments
Types:Time exclusive segments (Temporal Partitioning)
Complete Partitions swapped in and out.Spatial segments:
Partial segments can be swapped in and out.
Partial Reconfiguration Technology and Benefits
Partial Reconfiguration enables: System Flexibility
Perform more functions while maintaining communication links
Size and Cost Reduction Time-multiplex the hardware
to require a smaller FPGA
Power Reduction Shut down power-hungry tasks
when not needed
Fault Tolerant/self-repairing Systems Adaptive System
35
Power Reduction Techniques with PR Many techniques can be employed to reduce power
Swap out Swap out high-power functions for low-power functions low-power functions when maximum performance is not required
Swap out Swap out high-power I/O standards for lower-power I/O lower-power I/O when specific characteristics are not needed
Swap out Swap out black boxes for inactive regionsinactive regions Time-multiplexing functions Time-multiplexing functions will reduce power by reducing
amount of configured logic
36
Con
trol
ler
(Mic
rob
laze
)
ICAP
Fla
sh c
ontr
olle
r
Current PR Design Flow
• Steps– Partition the system into modules– Define static modules and reconfigurable
modules – Decide the number of PR regions (PRRs)– Decide PRR sizes, shapes and locations – Map modules to PRRs– Define PRR interfaces, instantiate slice
macros for PRR interfaces
• Optimization problems– Design partitioning– Number of PRRs– PRR sizes, shapes and locations – Mapping PRMs to PRRs– Type and placement of PRR interfaces
Module AModule C
Module B
Static modules Reconfigurable Modules (PRMs)
12
FP
GA
# of PRRs
PRR 1
PRR 2
Sta
tic
regi
on
Static modules
Modules: A and B
Modules: C
Des
ign
part
itio
ning
Des
ign
floo
rpla
nnin
g an
d bu
dget
ing
37
Reconfigurable Operating Systems Modern FPGA can be partially dynamically reconfigured. This
feature adds tremendous flexibility adds tremendous flexibility to the Reconfigurable Computing (RC) Field but also introduces challenges. also introduces challenges.
Reconfigurable Operating Systems tend to: Ease applications development, Higher lever of abstraction Ease application verification and maintenance.
38
UoG OS for Reconfigurable Computing UoG OS for Reconfigurable Computing
Phase I:Phase I: RTR Platform Scheduling Algorithms DFG Generator
Phase II:Phase II: Single GA for Efficient Task
Mapping Island Based GA (further
enhancement)
Phase III:Phase III: Resource Prediction with
Machine Learning
39
GA Task Allocation: Single GAGA Task Allocation: Single GA
40
GA Task Allocation: PlatformsGA Task Allocation: Platforms
41
Several types of platforms can be utilized for different applications. Each platform used can be more appropriate for a different
application in terms of resources used, power consumed and performance achieved.
Running a single GA is based on a single platform. What happens if we changed the platform and ran the GA for task
allocation?
GA Task Allocation: Island BasedGA Task Allocation: Island Based
42
An Island Based GA should overcome the limitations of a single GA implementation.
Each Island consists of a GA engine and a unique platform.
The Islands run in parallel to optimize the performance and power consumption based on a specific platform.
Every Island produces a separate Pareto front based on the platform it is using for its intended population.
An aggregation modules would combine all the Pareto fronts into an aggregated solution.
Prediction of ResourcesPrediction of Resources
43
1. The data preparation Module: Benchmarks are generated and evaluated
in terms of the power consumed and execution time using a simulator.
All necessary features that will be used to train the machine learning algorithms are extracted
2. Training Module: The framework utilizes the features
extracted from the previous stage to train and create a model that learns from previous historic information.
This module extracts useful hidden knowledge from the data and estimates/predicts resources of the reconfigurable computing system.
3. Prediction Module: Based on the model developed in the
Training Phase the module attempts to predict resources for unseen task graphs.
Design Space ExplorationDesign Space Exploration
What?: What?: Rapid exploration of various architectural solutions to be implemented on reconfigurable architectures in order to select the select the most efficient architecture most efficient architecture for one or several applications
When?: When?: Take place before architectural synthesis before architectural synthesis (algorithmic specification with high level abstraction language)
How?How?: Estimations are based on a functional architecture model (generic, technology-independenttechnology-independent)
Iterative exploration flowIterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model
44
45
Framework OverviewFramework Overview
Core Library
Communication Sub-System
Communication Sub-System
Embedded Processors
Dedicated Hardware
Accelerator
Point to Point
Common Bus
User Constraints
User ApplicationModeling Tool
Designer
Mathematical &
Meta-Heuristics Search the design space
Evaluation of the Generation Architecture
Generated ArchitectureEvaluation Results
ASIC
FPGA
CGRA
Implementation Platform
Proposed Architecture