View
253
Download
3
Category
Preview:
Citation preview
ENG3050 Embedded Reconfigurable
Computing Systems
Hardware Software Co-designHardware Software Co-designEmbedded System DesignEmbedded System Design
ENG3050 ERCS 2
Topics Embedded SystemsEmbedded Systems H/S Co-Design DefinitionH/S Co-Design Definition MotivationMotivation Design Steps, Design Steps,
Profiling, Profiling, Partitioning, Partitioning, AllocationAllocation
Xilinx Xilinx Soft and Hard CoresSoft and Hard Cores EDK/SDK, Vivado vs. ISEEDK/SDK, Vivado vs. ISE
ENG3050 ERCS 3
References ““Embedded System Design: A Unified Embedded System Design: A Unified
Hardware/Software Introduction” by Frank Vahid, Hardware/Software Introduction” by Frank Vahid, Wiley, 2002.Wiley, 2002.
““Hardware/Software Codesign: A systematic Hardware/Software Codesign: A systematic approach targeting data-intensive applications”, approach targeting data-intensive applications”, Wayne Luk, IEEE Signal processing Magazine, Wayne Luk, IEEE Signal processing Magazine, May 2005.May 2005.
“Hardware-Software Co-synthesis for Digital Systems”, R.Gupta, G. De Micheli, G., IEEE Design & Test of Computers, September 1993, pp. 29-41
“Hardware/Software Design Space Exploration for a Reconfigurable Processor”, A. Rosa, 2003.
“A Framework for Hardware/Software Co-design”, S. Kumar, Q. Wulf, IEEE 1993.
What is an embedded system?
5
Embedded Systems
• Single functional e.g. pager, mobile phone• Tightly constrained
– cost, size, performance, power, etc.• Reactive & real-time
– e.g. car’s cruise controller– delay in computation => failure of system
6
Embedded Systems?
Another definition for embedded systems:• Embedded = dedicated• Fixed functionality – No general purpose• Fixed in Hardware
Universität Dortmund
Characteristics of Embedded Systems
Must be efficient
– Energy efficient
– Code-size efficient(especially for systems on a chip)
– Run-time efficient
– Weight efficient
– Cost efficient Dedicated towards a certain application
Knowledge about behavior at design time can be used to minimize resources and to maximize robustness
Dedicated user interface(no mouse, keyboard and screen)
Must be efficient
– Energy efficient
– Code-size efficient(especially for systems on a chip)
– Run-time efficient
– Weight efficient
– Cost efficient Dedicated towards a certain application
Knowledge about behavior at design time can be used to minimize resources and to maximize robustness
Dedicated user interface(no mouse, keyboard and screen)
Universität Dortmund
Characteristics of Embedded Systems
Many ES must meet real-time constraints– A real-time system must react to stimuli from the
controlled object (or the operator) within the time interval dictated by the environment.
– For real-time systems, right answers arriving too late are wrong.
– „A real-time constraint is called hard, if not meeting that constraint could result in a catastrophe“ [Kopetz, 1997].
– All other time-constraints are called soft.– A guaranteed system response has to be explained
without statistical arguments
Many ES must meet real-time constraints– A real-time system must react to stimuli from the
controlled object (or the operator) within the time interval dictated by the environment.
– For real-time systems, right answers arriving too late are wrong.
– „A real-time constraint is called hard, if not meeting that constraint could result in a catastrophe“ [Kopetz, 1997].
– All other time-constraints are called soft.– A guaranteed system response has to be explained
without statistical arguments
Info
rmati
on T
ech
nolo
gy
9
A “short list” of embedded systems
And the list goes on and on
Anti-lock brakesAuto-focus camerasAutomatic teller machinesAutomatic toll systemsAutomatic transmissionAvionic systemsBattery chargersCamcordersCell phonesCell-phone base stationsCordless phonesCruise controlCurbside check-in systemsDigital camerasDisk drivesElectronic card readersElectronic instrumentsElectronic toys/gamesFactory controlFax machinesFingerprint identifiersHome security systemsLife-support systemsMedical testing systems
ModemsMPEG decodersNetwork cardsNetwork switches/routersOn-board navigationPagersPhotocopiersPoint-of-sale systemsPortable video gamesPrintersSatellite phonesScannersSmart ovens/dishwashersSpeech recognizersStereo systemsTeleconferencing systemsTelevisionsTemperature controllersTheft tracking systemsTV set-top boxesVCR’s, DVD playersVideo game consolesVideo phonesWashers and dryers
10
Hardware is not the whole System !!!
An Embedded System is the result of a projection of …– Architecture– Hardware– Software
… distinguished by its gross Functional Behaviour !
• Software is an important part of the Product and must be part of the Design Process
… or we are only designing a Component of the system.
ENG3050 ERCS 11
Definition – Hardware/Software Co-DesignDefinition – Hardware/Software Co-Design
The design of computer systems that incorporates both standardized off the shelf processors, or softwaresoftware, as well as specializedspecialized hardware hardware. The cooperative designcooperative design of hardware and
software components. The unificationunification of currently separate hardware
and software paths. The movement of functionalitymovement of functionality between
hardware and software.
ENG3050 ERCS 12
H/S Co-design: ExampleH/S Co-design: Example Optical wheel speed sensor. System constraints Area – 40 units, time – 100 cycles This could be implemented using either standardized
processors, specialized hardware or a combination of both
Input
Decoding
FIR
Filter
Tick to Speed
Inversion
Output
Encoding
ENG3050 ERCS 13
H/S Co-design: SoftwareH/S Co-design: Software Design implemented in software System constraints
Area – 48 unitsArea – 48 units > 40 units Time – 132 cyclesTime – 132 cycles > 100 cycles
Design Time – 2 months
Processor #1Processor #1 Processor #2Processor #2
ENG3050 ERCS 14
H/S Co-design: HardwareH/S Co-design: Hardware Design implemented in custom RTL hardware System constraints
Area – 24 unitsArea – 24 units, < 40 units Time – 52 cyclesTime – 52 cycles << 100 cycles
Surpasses both area and timing constraints by 40%40% Design Time – 9 months
Delay in design is unacceptable in a competitive world.
ENG3050 ERCS 15
H/S Co-designH/S Co-design Design implemented in hardware & software System constraints
Area – 37 unitsArea – 37 units, < 40 units Time – 95 cyclesTime – 95 cycles << 100 cycles
I. Design Time – 3.5 monthsII. Not as efficient as design II However, it establishes a balance balance between two extremes.
Processor #1Processor #1
ENG3050 ERCS 16
Achieve performanceAchieve performance by moving software bottlenecks to hardware Use hardware to meetmeet time & area constraints time & area constraints which cannot
be met alone using general purpose processors. Not possible to put everything in hardware due to limited limited
resourcesresources
Some code more appropriate for sequential implementation (i.e. achieve flexibilityachieve flexibility)
Today’s designs are focusing on Embedded Systems on Embedded Systems which require both hardware and software modules
MotivationsMotivations
ENG3050 ERCS 17
Motivations … contMotivations … cont
The complexitycomplexity and functionality of computer systems are increasing at a dramatic rate SystemOnChip (SOC)(SOC). It is difficult difficult for custom systems to be designed,
built, verified within an acceptable time periodwithin an acceptable time period even with advanced CAD tools unless standardized parts are used. (Solution?)
Take advantage of previously designedpreviously designed (IPs) and tested processor to reduce time and improve reliability.
ENG3050 ERCS 18
Trade-offs/DecisionsTrade-offs/Decisions Given a set of specified goals and
implementation technology, constraints, … designers consider trade-offsdesigners consider trade-offs in how hardware and software components work together.
Decisions, Constraints and Evaluations?Decisions, Constraints and Evaluations? Performance. Area. Power. Flexibility (Programmability). Development & Manufacturing costs. Reliability Robustness Maintenance Design evolution.
ENG3050 ERCS 19
Hw/Sw Co-Design: ResearchHw/Sw Co-Design: Research
Research in hardware-software co-design encompasses many interesting areas of research such as:
I.I. System specificationSystem specification and modelingII.II. Design ExplorationDesign Exploration
System co-verificationco-verification and co-simulation Code generationCode generation for hardware/software Hardware/Software interfacinginterfacing
III.III. PartitioningPartitioning IV. SchedulingV. However the most important objective is to develop
a unified design methodology/tool for creating systems containing both hardware and software.
ENG3050 ERCS 20
A Simple ApproachA Simple Approach
Application
Evaluation
Decision
S/W H/W
Partitioning
Profiling
Scheduletasks
ENG3050 ERCS 21
Profiling and Partitioning
SW__________________
SW__________________
SW__________________
HW__________________
SW__________________
SW__________________
ProcessorProcessor ProcessorASIC/FPGA
Critical Regions
ProfilerProfiler Benefits Speedups of 2X to
10X typical Far more potential
than dynamic SW optimizations (1.2x)
Energy reductions of 25% to 95% typical
Time Energy
SW OnlyHW/ SW
Time Energy
SW Only
ProcessorProcessor
ENG3050 ERCS 22
ProfilingProfiling Profiling allows you to learn where your programwhere your program
spent its timespent its time and which functions called which other functions while it was executing. The profiler uses information collected during the actual
execution of your program, therefore, it can be used on programs that are too largetoo large or tootoo complex to analyzecomplex to analyze by reading the source.
This information can show you which pieces of your program are slower than you expectedslower than you expected. These might be candidates for either:
Rewriting code to make your program execute faster. Moving these functions to hardware.
Profiling is an analysis of software performance
– Where routine time is being spent
– How many times functions are being called
– Included tool in SDK
– Which algorithms to consider moving to hardware
Results in two useful formats
Profiling
Samples per function: How much time is spent in each
routine
Function call graph: Which routine call, which function, and how many
times
ENG3050 ERCS23
Hardware/software intrusiveRequires a hardware timerhardware timer
Requires a dedicated area in memorydedicated area in memory
Executable is modified Executable is modified with profiler routines
A dedicated hardware timer interrupts the processor at a fixed intervalThe interrupt routine keeps track of the program counter at each interrupt
A histogram of PC locations is kept in profile RAM
Interrupt interval time is programmable
Every function call in the software application is annotated by the compiler to track which functions are being called
How Does Profiling Work?
24
ENG3050 ERCS 25
Profiling: StepsProfiling: Steps You must compile and link your program with
profiling enabled. cc -o myprog.exe myprog.c utils.c –g –pgcc -o myprog.exe myprog.c utils.c –g –pg
You must then execute your program to generate a profile data file Your program will write the profile data into a file called
`gmon.outgmon.out’ just before exiting.
You must run gprof to analyze the profile data. gprofgprof optionsoptions myprog.exe gmon.outgmon.out > outfile The gprof program prints a flat profile and a call graph
ENG3050 ERCS 26
Profiling: Useful HintsProfiling: Useful Hints Options:
-e-e function_namefunction_name : tells gprof to NOT print information about the function function_namefunction_name (and its children …) in the call graph.
-f-f function_namefunction_name: causes gprof to limit the call graph to the function function_namefunction_name and its children.
-b-b : gprof doesn’t print the verbose blurbs that try to explain the meaning of all of the fields in the tables.
ENG3050 ERCS 27
Profiling: Flat ProfileProfiling: Flat Profile
% time% time : is the percentage of the total execution time your program spent in this function. cumulative secondscumulative seconds: This is the cumulative total number of seconds the computer spent
executing this function plus time spent in all the functions above. self secondsself seconds: This is the number of seconds accounted for by this function alone. callscalls: this is the total number of times the function was called. self ms/callself ms/call: This represents the average number of milliseconds spent in this function per
call. total ms/calltotal ms/call: This represents the average number of milliseconds spent in this function and
its descendants per call. namename: This is the name of the function.
ENG3050 ERCS 28
Simple Approach: Simple Approach: DrawbacksDrawbacks
I. Some functions might not be easily mapped onto hardware.
II. Decisions taken very early at profiling phase might not be optimal.
III. No consideration for interfacing and communication.
IV. If the application changes slightly then we need to re-profile and re-partition.
ENG3050 ERCS 29
Applications Not suitable for RCSApplications Not suitable for RCS
Not all applications are suitable for Reconfigurable Computing:
Applications that involve extensive recursionextensive recursion, for example, are a poor match because the synthesized “hardware” must be of fixed size.Applications that have only a small percentage of parallelismsmall percentage of parallelism (1-5%) will not make advantage of RCS.Applications that are I/O boundI/O bound will also suffer due to memory I/O transferApplications that require floating pointrequire floating point arithmetic
Design Space ExplorationDesign Space ExplorationScheduling/Arbitration
proportionalshareWFQ
staticdynamicfixed priority
EDFTDMA
FCFS
Communication Templates
Architecture # 1 Architecture # 2
Computation Templates
DSP
E
Cipher
SDRAMRISC
FPGA
LookUp
DSP
TDMA
Priority
EDF
WFQ
RISC
DSP
LookUp
Cipher
E E E
E E E
static
Which architecture is better suitedfor our application?
ENG3050 ERCS 30
ENG3050 ERCS 31
H/S Codesign: A FrameworkH/S Codesign: A FrameworkSystem
Representation
System
EvaluationCoDesign
Decomposition
(Break down system
functions into a
collection of
sub-functions)
H/S Partitioning
(Determine which of
the sub-functions
should be
implemented in H/S)
Refinement
(Produce a hardware
software alternative
via evaluation)
System
Integration
ENG3050 ERCS 32
Co-Synthesis/Co-DesignCo-Synthesis/Co-Design
ENG3050 ERCS 33
Partitioning & SchedulingPartitioning & Scheduling Task partitioningpartitioning and task schedulingscheduling are required in
many applications, for instance co-designco-design systems, Multi Processing Systems Multi Processing Systems and High Level SynthesisHigh Level Synthesis.
Sub-tasks extracted from the input description should be implemented in the WhereWhere? The right placeplace (using the Partitioner/Partitioner/PlacerPlacer) WhenWhen? The right timetime (using the schedulerscheduler)
It is well known that such scheduling and partitioningscheduling and partitioning problems are NP-completeNP-complete.
Optimization techniques based on heuristic methodsheuristic methods are generally employed to explore the search space so that feasible and near-optimal solutions can be found.
ENG3050 ERCS 34
System PartitioningSystem Partitioning
Good partitioning mechanism:
1) Minimize communication across bus
2) Allows parallelism both hardware (FPGA) and processor operating concurrently
3) Load Balancing Near peak processor utilization at all times (performing useful work)
process (a, b, c) in port a, b; out port c;{ read(a); … write(c);}
Specification
Line (){ a = … … detach}
Processor
Capture
Model FPGA
Partition
Synthesize
Interface
ENG3050 ERCS 35
Terminology: HypergraphsTerminology: Hypergraphs
a netlist is a hyper-graph Hyper-graphs can be approximated as graphs, breaking
each hyper-edge into a clique of edges
a hypergraph H = <V, Eh>
V is a set of verticesh Eh is a subset of vertices, 2V
a graph G = <V, E>
V is a set of verticese E is a pair of vertices (u,v)
ENG3050 ERCS 36
Bi-partitioning ProblemBi-partitioning Problem given a hyper/graph G
find a partition P of VV1, V2 s.t V1V2=, V1V2=V
minimizing number of edges that cross the cutmin c(P) = all h w(h) if (uV1 and vV2)
where u and v are connected by edge h
subject to a capacity constraint
> |V1| / |V2| >
ENG3050 ERCS 37
Bipartitioning ApproachesBipartitioning Approaches Exact Methods:
Mixed Integer Programming (using Branch and Bound) !! min-cut / max-flow (Ford-Fulkerson 1962)
maximum flow through graph = minimum cut useful for establishing unconstrained bound
Heuristics (Local Search) Kernighan-Lin (1970)
operates on graphs swap all nodes once, in pairs that yield max. gain choose greatest gain over pass,repeat until no improvement O(n2log n)
Fiduccia-Mattheyses (1982) operates on hypergraphs O(p), linear time!
Meta Heuristics (avoid getting stuck in local minima) Simulated annealing
select some random moves based on “temperature” design hopefully “cools” into optimal solution computationally intensive
Tabu Search Genetic Algorithms Particle Swarm Optimization
ENG3050 ERCS 38
Fiduccia-MattheysesFiduccia-Mattheyses
- generate initial partition- calculate gain g(c) of moving each cellwhile improvement{
clear cells being locked;while max g(c) > 0 | c locked {
select cell with max g(c) | c locked;move c across the cut;c → locked;update g(c) for all of c’s neighbors;
}
}
oneonepasspassO(p)O(p)
ENG3050 ERCS 39
ExampleExample
f
a c
ed
b
• all edges have unit weight
• given balance criteria:
|V1| -1 ≥ |V2| ≥ |V1| + 1
goal: partition graph into twodisjoint halves so as to minimize thenumber of hyperedges that span the cut
ENG3050 ERCS 40
Example (cont’d)Example (cont’d)
f
a c
ed
b
Step 1.Step 1.
random partitionassigned to keep balance
number of cuts = 5number of cuts = 5
ENG3050 ERCS 41
Example (cont’d)Example (cont’d)
d
a c
ed
b Step 2.Step 2.
initial gains arecalculated for each cell
results are placed intobucket array
+1+2
+2
+1-1
+2
number of cuts = 5
ENG3050 ERCS 42
Example (cont’d)Example (cont’d)
d
a c
ed
b Step 3.Step 3. cell is selected
gains of critical netsare updated
cell is locked fromfurther movement
+10
0
+1-1
0
number of cuts = 3number of cuts = 3
ENG3050 ERCS 43
Example (cont’d)Example (cont’d)
d
a
c
ed
b Step 3.Step 3. Another cell is selected
gains of critical netsare updated
cell is locked fromfurther movement
0
00
-1-1
0
number of cuts = 2number of cuts = 2
ENG3050 ERCS 44
InterfacingInterfacing Interfacing
between software and hardware modules is crucial for successful Co-design
I. How data is passed between sub-modules efficiently.
II. The rate of exchange of information between modules
System Description
Hw/Sw Partitioning
Co Synthesis
InterfaceSoftware Hardware
System Integration
Co-Simulation
ENG3050 ERCS 45
Interface Models: FIFOInterface Models: FIFOSynchronization through a FIFOFIFO can be implemented either in hardware or in
softwareEffectively reconfigure hardware (FPGA) to allocate
buffer space as needed Interrupts used for software version of FIFO
d1
d2d3
p1 p2 p3
r2
r3
FPGAControl/Data FIFO
ENG3050 ERCS 46
MIPS/ARM
I$
D$
Configurable Logic
Profiler
Dynamic Part. Module
(DPM)
Profile application to determine critical regions
Partition critical regions to hardware
Program configurable logic & update software binary
Partitioned application executes faster with lower energy consumption
Initially execute application in software only
11
22
33
44
55
Warp Processors
ENG3050 ERCS 47
SummarySummary Hardware/Software co-design Hardware/Software co-design is becoming
the common design style for building systems. H/S co-design allows the majority of a system
to be designed quickly designed quickly with standardized parts while special purpose hardware is used for time critical portions of the system.
Xilinx and Altera provide complete flow for H/S co-design.
Issues:I. How to partition the system?II. Communication overhead!!III. Platforms to be usedIV. Languages that support this paradigm.
ENG3050 ERCS 48
Xilinx Tools Xilinx Tools
49ENG3050 ERCS
Embedded CPUs
PowerPC 405 (hard core) 32 bit embedded PowerPC RISC architecture Up to 450 MHz 2x16 kB instruction and data caches Memory management unit (MMU) Embedded in Virtex-II Pro and Virtex-4/5/6
ARM Cortex –A9 (hard core) 32 bit multicore processor Up to 900 MHz Xilinx Zynq 7000 Processing platform Device is processor based attached to FPGA High level of performance Reduces power, cost, size
MicroBlaze (soft core) 32 bit RISC architecture 2 64 kB instruction and data caches Hardware multiply and divide OPB and LMB bus interfaces...
50ENG3050 ERCS
Embedded Processors
Embedded Processor
Core Type
Max Clock Frequency
Slices PLBsBlock RAMs
PowerPC Hard 222 MHz 1000 250 9
Microblaze Soft 180 MHz 940 235 9
Picoblaze Soft 221 MHz 333 84 3Picoblaze (optimized)
Soft 233 MHz 274 69 3
Hard core Faster Fixed position Few devices
Virtex-4 Processors:
Soft core Slower Can be placed anywhere Applicable to many devices
PowerPCPowerPCMicroBlazeMicroBlazeMicroBlazeMicroBlazePicoBlazePicoBlaze
ENG3050 ERCS 51
PowerPC405 Core
Dedicated Hard IPFlexible Soft IP
RocketIO
PowerPC-based Embedded Design
Full system customization to meet performance, functionality, and cost goals
DCR Bus
UART GPIOOn-Chip
PeripheralHi-Speed
PeripheralGB
E-Net
e.g.Memory
Controller
Arb
iter
On-Chip Peripheral Bus
OPB
Arb
iter
Processor Local Bus
Instruction Data
PLB
DSOCMBRAM
ISOCMBRAM
Off-ChipMemory
ZBT SRAMDDR SDRAM
SDRAM
BusBridge
IBM CoreConnect™on-chip bus standardPLB, OPB, and DCR
ENG3050 ERCS 52
MicroBlaze-based Embedded Design
Flexible Soft IPMicroBlaze32-Bit RISC Core
UART 10/100E-Net
On-ChipPeripheral
Off-ChipMemory
FLASH/SRAM
LocalLink™FIFO Channels
0,1…….32
CustomFunctions
CustomFunctions
BRAM Local Memory
BusD-CacheBRAM
I-CacheBRAM
ConfigurableSizes
Arb
iter
Processor Local Bus
Instruction Data
PLBBus
Bridge
PowerPC405 Core
Dedicated Hard IP
Arb
iter
Processor Local Bus
Instruction Data
PLBBus
BridgeBus
Bridge
PowerPC405 Core
Dedicated Hard IP
PowerPC405 Core
Dedicated Hard IP
PowerPC405 Core
Dedicated Hard IPPossible inVirtex-II Pro
Hi-SpeedPeripheral
GB E-Net
e.g.Memory
Controller
Hi-SpeedPeripheralHi-Speed
PeripheralGB
E-NetGB
E-Net
e.g.Memory
Controller
e.g.Memory
Controller
Arb
iter OPB
On-Chip Peripheral Bus
ENG3050 ERCS 53
MicroBlaze: Architecture & FeaturesMicroBlaze: Architecture & Features
• RISC• Thirty-two 32-bit general purpose registers• 32-bit instruction word with three operands and two addressing modes• Separate 32-bit instruction and data buses OPB (On-chip Peripheral Bus)Separate 32-bit instruction and data buses OPB (On-chip Peripheral Bus)• Separate 32-bit instruction and data buses LMB (Local Memory Bus)Separate 32-bit instruction and data buses LMB (Local Memory Bus)
Architecture
Features
OPB
LMB
ENG3050 ERCS 54
Embedded DevelopmentTool Flow Overview
Compiler/Linker
(Simulator)
C Code
Debugger
Standard Embedded SWDevelopment Flow
CPU code in on-chip memory
?CPU code in
off-chip memory
Download to Board & FPGA
Object Code
Standard FPGA HWDevelopment Flow
Synthesizer
Place & Route
Simulator
VHDL/Verilog
?
Download to FPGA
EDK• The Embedded Development Kit (EDK) consists of the
following:– Xilinx Platform Studio – XPS– Base System Builder – BSB– Create and Import Peripheral Wizard– Hardware generation tool – PlatGen– Library generation tool – LibGen– Simulation generation tool – SimGen– GNU software development tools– System verification tool – XMD– Virtual Platform generation tool - VPgen– Software Development Kit (Eclipse)– Processor IP– Drivers for IP– Documentation
• Use the GUI or the shell command tool to run EDK
ENG3050 ERCS 55
EDK Files
• MHS = Microprocessor Hardware Specification• MSS = Microprocessor Software Specification
• MPD = Microprocessor Peripheral Description• PAO = Peripheral Analyze Order
• BBD = Black-Box Definition• MDD = Microprocessor Driver Description• BMM = BRAM Memory Map
ENG3050 ERCS 56
ENG3050 ERCS 57
GenerateNetlist
*.mhs
Platform Definition(peripherals, configuration,
connectivity, address space)
Design Flow: Hardware IDesign Flow: Hardware I
Hardware
EDK / Xilinx Platform Studio
ENG3050 ERCS 58
Design Flow: Hardware II, ISE EnvDesign Flow: Hardware II, ISE Env
Hardware
Platform Definition(peripherals, configuration,
connectivity, address space)
EDK: Embedded Development Kit XPS: Xilinx Platform Studio ISE: Integrated Software Environment MHS: Microprocessor Hardware Specification
GenerateNetlist
ISE
Platform Ext.Proj.Nav. / VHDL
*.mhs
*.bit
XPS
GenerateBitstream
*.ucf
ENG3050 ERCS 59
Platform Definition(peripherals, configuration,
connectivity, address space)
EDK: Embedded Development Kit XPS: Xilinx Platform Studio ISE: Integrated Software Environment MHS: Microprocessor Hardware Specification
GenerateNetlist
*.mhs
*.bit
XPS
GenerateBitstream
*.ucf
Design Flow: SoftwareDesign Flow: Software
ISE
Platform Ext.Proj.Nav. / VHDL
Hardware Software
*.elf
*.c *.asm
Compile &
Link
*.h
Gen.Libs
ENG3050 ERCS 60
Design Flow: Combine HW + SWDesign Flow: Combine HW + SW
GenerateNetlist
ISE
Platform Ext.Proj.Nav. / VHDL
*.mhs
*.elf
*.c *.asm
Compile &
Link
UpdateBitstrea
m
*.bit
*.h
Gen.Libs
Platform Definition(peripherals, configuration,
connectivity, address space)
EDK: Embedded Development Kit XPS: Xilinx Platform Studio ISE: Integrated Software Environment MHS: Microprocessor Hardware Specification
*.bit
XPS
GenerateBitstream
*.ucf
Hardware Software
*.bmm
The Zynq-7000 AP SoC architecture consists of two major sections– PS: Processing system
• Dual ARM Cortex-A9 processor based• Multiple peripherals• Hard silicon core
– PL: Programmable logic• Uses the same 7 series programmable logic
Artix™-based devices: Z-7010, Z-7015, Z-7020 (high-range I/O banks only)
Kintex™-based devices: Z-7030, Z-7035, Z-7045, Z-7100 (mix of high-range and high-performance I/O banks)
The PS and the PL
© Copyright 2014 Xilinx 61
Zynq-7000 AP SoC Block Diagram
© Copyright 2014Xilinx
62
Zynq Architecture Built-in Peripherals
Two USB 2.0 OTG/Device/Host
Two Tri- Mode GigE (10/100/1000)
Two SD/SDIO interfaces– Memory, I/O and combo cards
Two CAN 2.0Bs, SPIs , I2Cs, UARTs
Four GPIO 32bit Blocks– 54 available through MIO; other
available through EMIO
Multiplexed Input/Output (MIO)– Multiplexed pinout of peripheral and
static memories
Extended MIO– Maps PS peripheral ports to the PL
© Copyright 2014 Xilinx
63
GPIO blocks– Four separate banks of 32 GPIO bits
each• Two banks connect to the 54 MIO pins
32 bits and 22 bits, respectively
• Two banks connect to EMIO (64 bits)
– Each GPIO bit can be dynamically programmed as input or output
– Reset values independently configurable for each bit
– Programmable interrupt generation for each bit
• One interrupt generated per GPIO bank
General-Purpose I/O
© Copyright 2014Xilinx64
Create a new project, or open an existing project
Add/Create a new embedded source in Vivado
Use IP integrator, Block automation, and connection automation features of Vivado to construct(modify) the hardware portion of the embedded design
Create(Update) top level HDL model
Add additional logic at the top-level
Synthesize, implement, and generate the design in Vivado
Export the bitstream, processor hardware description, and launch SDK
Embedded System Design Flow using Vivado & SDK
© Copyright 2014 Xilinx
65
Create a new:– Software board support package (BSP) and
– Application projects in the SDK
Compile the software with the GNU cross-compiler in SDK
[optional] Download the programmable logic’s completed bitstream using SDK or through a hardware session in Vivado
Use SDK to download the program (the ELF file)
Embedded System Design Flow using Vivado & SDK
© Copyright 2014 Xilinx
66
Embedded System Design using Vivado and SDK
15. Program bitstream & .elf into ZynqVivado
SDK
1. Launch Vivado2. Create Block Design
8. Create Top-Level HDL9. Add Constraints (file)10. Generate Bitstream => .bit11. Export hardware to SDK
3. Add PS7 4. Configure PS settings5. Run Block Automation6. Add and configure IPs7. Run Connection Automation
12. Create Board Support Package13. Create or add Software Project14. Build application => .elf
ZedBoard
© Copyright 2014 Xilinx
67
Zynq Device Processing System Configuration
© Copyright 2014 Xilinx
68
Run Block Automation
© Copyright 2014 Xilinx
Create default configuration for the platform
Base on board specified in project settings
– E.g. Zedboard: DDR, GPIO, Uart, USB, QSPI
69
Extending Hardware in IP Integrator
Add IPs
Configure IPs
Run Connection Automation
© Copyright 2014 Xilinx 70
Extending Hardware in Vivado
Create a top level HDL model
Optionally, add other hdl files to the design
Add user constraint files to connect PL pins
– PS/MIO handled automatically
– If you miss any pin constraints (IO standard must be explicitly specified), the tools will error out during the bit generation process
Generate bitstream for PL
© Copyright 2014 Xilinx
71
Eclipse IDE-based Software Development Kit (SDK)– Board support package creation : LibGen
– GNU software development tools
– C/C++ compiler for the ARM Cortex-A9 processor (gcc)
– Debugger for the ARM Cortex-A9 processor (gdb)
Board support packages (BSPs)– Stand-alone BSP
• Free basic device drivers and utilities from Xilinx
• NOT an RTOS
Embedded System Tools: Software
© Copyright 2014 Xilinx
72
SDK Workbench Views
1. C/C++ project outline displays the elements of a project with file decorators (icons) for easy identification
2. C/C++ editor for integrated software creation
3. Code outline displays elements of the software file under development with file decorators (icons) for easy identification
4. Problems, Console, Properties views list output information associated with the software development flow
© Copyright 2014 Xilinx
1
2
3
4
73
Export Hardware Design to SDK
Software development is performed with the Xilinx Software Development Kit tool (SDK)
An XML description (.hdf Hardware Description file) of the hardware is imported in the SDK tool– The hardware platform is built on this
description
– Only one hardware platform for an SDK project
SDK will then associate user software projects to hardware
© Copyright 2014 Xilinx
74
Build Software Application in SDK
Create software platform– System software, board
support package
– LibGen program
Create software application
Optionally, create linker script
Build project – compile, assemble, link output
file <app_project>.elf
© Copyright 2014 Xilinx75
Software Management Settings
Software is managed in three major areas– Compiler/Linker Options
• Application program
– Software Platform Settings• Board support package
– Linker Script Generation• Assigning software to memory
resources
© Copyright 2014 Xilinx76
GCC translates C source code into assembly language
GCC also functions as the user interface, passing options to the GNU assembler and to the GNU linker, calling the assembler and the linker with the appropriate parameters
Supported cross-compilers
ARM processor compiler
– GNU GCC (arm-xilinx-eabi-gcc)
– GNU Linux GCC (arm-xilinx-linux-eabi-gcc)
GNU Tools: GCC
© Copyright 2014 Xilinx
77
Input: assembly language files
– File extension: .s
Output: object code
– File extension: .o
Contains
– Assembled piece of code
– Constant data
– External references
– Debugging information
Typically, the compiler automatically calls the assembler
Use the -Wa switch if the source files are assembly only and use gcc
GNU Tools: AS
© Copyright 2014 Xilinx78
Inputs– Several object files
– Archived object files (library)
– Linker script (*.ld)
Outputs– Executable image (ELF)
– Map file
GNU Tools: Linker (LD)
© Copyright 2014 Xilinx79
Xilinx additions to the Eclipse IDE– Software Repositories
– BSP Settings
– Generate Linker Script
– Program the programmable logic• Bitstream must be available
– Program Flash Memory
– Launch XMD Console
– Launch Shell
– Create Zynq Boot Image
– SysGen Co-Debug Settings
– Configure JTAG Settings
Integrated Xilinx Tools
© Copyright 2014 Xilinx
80
Download the bitstream– Only required if PL is used
– Input file <top_name>.bit
The Xilinx hardware session allows downloading the bitstream in to the target
The hardware session can be created from– SDK
– Vivado
Requires that the download cable is connected
Configuring FPGA and Downloading Application
© Copyright 2014 Xilinx81
Set profiling for the BSP– Enable software intrusive profiling
– Enable the -pg option
Set profiling for the application– Enable the compiler for profiling with the –pg option
– Configure the profiler memory
– Set the interrupt frequency and bin value
Compile, link, and generate the ELF executable
Download the executable into a hardware or software simulator
Run the software application until completion or for an "amount of time"
Execute the GNU gprof tool to view the generated profile report
Profiling Procedure
© Copyright 2014 Xilinx
82
Select Xilinx Tools > Board Support Package Settings
Select standalone
Enable software profiling
Select the profiling timer
Select CPU_cortexa9– Add -pg to the Value column
for the extra_compiler_flags option
Configuring the Software Platform Settings
© Copyright 2014 Xilinx
83
If any of the embedded design resides in programmable logic, download the bitstream to the programmable logic– Select Xilinx Tools > Program
FPGA
Select Run > Run Configurations and create a new configuration– Give appropriate name
– Select the elf file that was compiled with –pg
Profile Configuration: Create a Run Configuration
© Copyright 2014 Xilinx
84
In the Profile Options tab– Enable profiling
– Set the sampling frequency at which the timer will interrupt
• Higher speed will require more memory but will give a finer resolution
– Set the Histogram bin size
– Set the location of RAM that the profiler can use
• make sure that the software application is not using this memory
Click Run to download the program and begin execution
Set Profile Option in Run Configuration
© Copyright 2014 Xilinx
85
Double-click gmon.out to launch gprof
Point to executable ELF; usually selected by default
gprof report launches
Report toolbar control report options and view capabilities– Sort samples per file
– Sort samples per function
– Sort samples per line
– Display function call graph
– Switch sample/time
Viewing Profiling Reports: Launching gprof
© Copyright 2014 Xilinx86
Profiled Output in SDK
© Copyright 2014 Xilinx
1: Sort Samples per File 2: Sort Samples per Function
3: Sort Samples per Line 4: Display Function Call Graph87
Gprof report options allow report view flexibility and export
Profiling Report Options
© Copyright 2014 Xilinx
1. Show/hide columns 2. Export to CSV 3. Sorting 4. Switch time<>Samples
88
Task Implementation Decision
Keep it in software– Not in critical path
– Enough "free" cycles
– Easier to code in software than in hardware
• Uses math library functions
– NEON co-processor• Supports integer vector
operations
• Single floating-point operations
Move to hardware– Programmable logic co-
processor• Customized to user's needs
• Excellent for iterative and pipelined processing
– Add soft core processor in PL• Both Cortex-A9 and MicroBlaze
processors can co-exist in the AP SoC
© Copyright 2014 Xilinx
89
Task Implementation Decision: Fir Filters
© Copyright 2014 Xilinx
90
Task Implementation Decision: Fir Filters
© Copyright 2014 Xilinx
91
Profiling the application with the hardware IP
Task Implementation Decision: Fir Filters
© Copyright 2014 Xilinx
92
AXI is Part of ARM’s AMBA
Older Performance Newer
AMBA 3.0(2003)
AMBA: Advanced Microcontroller Bus ArchitectureAXI: Advanced Extensible Interface
© Copyright 2014 Xilinx
93
AXI is Part of AMBA
AMBA 3.0(2003)
AMBA 4.0(2010)
Same Spec
Enhancements for FPGAs
Interface Features Similar to
Memory Map / Full (AXI4)
Traditional Address/Data Burst
(single address, multiple data)
PLBv46, PCI
Streaming
(AXI4-Stream)
Data-Only, Burst Local Link / DSP Interfaces / FIFO / FSL
Lite
(AXI4-Lite)
Traditional Address/Data—No Burst
(single address, single data)
PLBv46-single
OPB
© Copyright 2014 Xilinx 94
AXI is an interconnect system used to tie processors to peripherals– AXI Full memory map: Full performance bursting interconnect
– AXI Lite: Lower performance non bursting interconnect (saves programmable logic resources)
– AXI Streaming: Non-addressed packet based or raw interface
AXI Interconnect
© Copyright 2014 Xilinx95
Basic AXI Signaling – 5 Channels
1. Read Address Channel
2. Read Data Channel
3. Write Address Channel
4. Write Data Channel
5. Write Response Channel
© Copyright 2014 Xilinx
96
SOURCE asserts and holds VALID when DATA is available
DESTINATION asserts READY if able to accept DATA
DATA transferred when VALID and READY = 1
SOURCE sends next DATA (if an actual data channel) or deasserts VALID
DESTINATION deasserts READY if no longer able to accept DATA
All AXI Channels Use A Basic “VALID/READY” Handshake
© Copyright 2014 Xilinx97
The AXI Interface—AXI4Lite
No burst
Data width 32 or 64 only– Xilinx IP only supports 32-bits
Very small footprint
Bridging to AXI4 handled automatically by AXI_Interconnect (if needed)
AXI4-Lite Read
AXI4-Lite Write
© Copyright 2014 Xilinx
98
The AXI Interface—AXI4
Sometimes called “Full AXI” or“AXI Memory Mapped”
– Not ARM-sanctioned names
Single address multiple data
– Burst up to 256 data beats
Data Width parameterizable
– 1024 bits
AXI4 Read
AXI4 Write
© Copyright 2014 Xilinx
99
The AXI Interface—AXI4Stream
No address channel, no read and write, always just master to slave
– Effectively an AXI4 “write data” channel
Unlimited burst length
– AXI4 max 256
– AXI4-Lite does not burst
Virtually same signaling as AXI Data Channels
– Protocol allows merging, packing, width conversion
– Supports sparse, continuous, aligned, unaligned streams
AXI4-Stream Transfer
© Copyright 2014 Xilinx100
ENG3050 ERCS 101
SummarySummary Xilinx provides CAD tools in the form of
EDK/ISE/Vivado to implement soft cores and hard cores and manage the whole hardware/software development process.
The soft cores in the form of a single Micro-Blaze and hard cores in the form of ARM processors enable hardware/software co-design where sequential code can run on the processor and bottlenecks can run on a dedicated hardware accelerator attached to the soft/hard cores.
ENG3050 ERCS 102
ENG3050 ERCS 103
Co-design: ToolsCo-design: Tools Co-design tools should provide an
almost automatic frameworkautomatic framework for producing a balanced and optimized design from some initial high level specification.
The goal of co-design tools and platforms is not to push towards this not to push towards this kind of kind of total automationtotal automation.
The designer interactionsdesigner interactions and continuous feedback is considered essential.
The main goal is to incorporate in the black box of co-design tools that support for shifting functionalitysupport for shifting functionality and implementation between HW SW with effective and efficient evaluation.
ENG3050 ERCS 104
H/S Co-Design: Approaches
Opposite strategiesVulcan (“primal” approach)
Functionality all in HW (HardwareC) initially Move some to CPU to reduce architecture cost
Cosyma (“dual” approach) Functionality all in SW (Cx) initially Move some to ASIC to meet performance goals
LycosConvert all functionality to neutral form
ENG3050 ERCS 105
Partitioning AlgorithmsPartitioning Algorithms
Assume everything initially in software Select task for swapping Migrate to hardware and evaluate cost?
Timing, hardware resources, program and data storage, synchronization overhead
Cost evaluation and move evaluation similar to what we’ve seen regarding min-cut FM Algorithm.
task
Software Hardware
List of tasks List of tasks
ENG3050 ERCS 106
AutomationAutomation
Compiler profiler determines dependence and rough performance estimates
Result of compilation is synthesizable HDL and assembly code for the processor
ENG3050 ERCS 107
Soft and Hard cores in current FPGAs
Power SupplyCLKCLK
CLKcustomIF-logic
SDRAM SDRAMSRAM SRAMSRAM
Memory Controller
UARTLC
DisplayController
InterruptController Timer
AudioCodec
CPU(uP / DSP) Co-
Proc.
GP I/O
AddressDecode
Unit
EthernetMAC
ENG3050 ERCS 108
FPGA
Next Step...Next Step...
CLKCLK
CLKcustomIF-logic
SDRAM SDRAMSRAM SRAMSRAM
Memory Controller
UART
DisplayController
Timer
Power Supply
LC
AudioCodec
CPU(uP / DSP) Co-
Proc.
GP I/O
AddressDecode
Unit
EthernetMAC
InterruptController
ENG3050 ERCS 109
Configurable System on a Chip (CSoC)Configurable System on a Chip (CSoC)
Power Supply
SDRAM SDRAMSRAM SRAMSRAM
LC
AudioCodec EPROM
ENG3050 ERCS 110
Soft CPU Core: Soft CPU Core: „MicroBlaze“ „MicroBlaze“ (Xilinx Inc.)
ENG3050 ERCS 111
MicroBlaze: Bus ConfigurationsMicroBlaze: Bus Configurations
1.
2.
3.
4.
5.
6.
MicroBlaze core
• LMB: Memory Controller (BRAMs)
• OPB: Ext. Memory Ctrl., Interrupt Ctrl., UART, Timer,
Watchdog, SPI, JTAG-UART, etc.
Recommended