Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Electronics Resurgence Initiative:Architectures
Tom RondeauMicrosystems Technology Office
DSSoC Proposer’s Day
09/18/2017
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Who am I?
• Previous project lead for GNU Radio• Researcher then Adjunct with IDA’s Center for Communication Research• Researcher at UPenn
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 3
Can we have both programmability and specialization?
Matrix Multiply (ISAT 2012 study)
Programmability• Productivity has come at the cost
of compute efficiency• Abstraction tends to ignore the
underlying hardware
Specialization• Performance has come at the
cost of usability• Difficulty in programming
and system integration
Goal
moreprogrammable
lessprogrammable
notprogrammableEn
ergy
Effi
cienc
y (M
OP/m
W)
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 4
ERI Page 3: Architectures
Build new processors that solve the significant computing needs of today’s and tomorrow’s applications.
1: Domain Specific System on Chip (DSSoC)Streaming Data is latency sensitive, small but many work loads
2: Software Defined Hardware (SDH)Big Data is efficiency sensitive, large and repeatable work loads
DSSoC
SDH
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 5
Streaming DataLatency sensitive, small but many work loads
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 6
• Economic challenges specific to the DoD• Market size• Uniqueness of problems• Breadth of important but small problems
Difficult to support cost-effective solutions
• Technical challenges• Programmability• Integration of single-application products• Integration/test
Needs evolve faster than we can develop solutions
National security issues and impact
The DoD cannot afford the limited time and high cost of the programmer heroes.
APG-77 AESA Software Summary
http://washingtoniceaa.com/files/presentations/SOFTWARE_MAINTENANCE_O&M_COST.pdf
Converged Collaborative Elements for RF Task Operations (CONCERTO) [Ted Woodward]
7
UNCLASSIFIED//FOUO
CONCERTO Vision: RF Convergence
Today’s Constrained Systems
Manager (Radar)
RF Front End
Antennas / Apertures /
Airframe
Modes on Digital
Processor
Manager (EW)
RF Front End
Antennas / Apertures /
Airframe
Modes on Digital
Processor
Manager (Comms)
RF Front End
Antennas / Apertures /
Airframe
Modes on Digital
Processors
System and Sensor
Resource Manager
Adaptable Aperture
integrated w/ Airframe
RF Modes
Hetero-generous Processor
Unified RF Front end
Abstract
• OBJECTIVE: Develop converged RF system with radar, electronic warfare, and communications modes to enable new approaches to tactical RF missions
• Three Phases• Phase 1 (current phase): study missions,
establish subsystem technical readiness, create new RF systems architecture
• Phase 2: design prototype RF system• Phase 3: build and demonstrate the
prototype RF system in a flight test
UAS: Unmanned Aircraft System
End of program outputs• More capability on smaller UAS hosts• RF virtual machine supports portable RF modes • Intelligent System and Sensor Resource Manager • Unified, scalable design showing new behavior:
maneuver dynamically in spectrum, time, and spaceDistribution Statement “A” (Approved for Public Release, Distribution Unlimited)
8
CONCERTO Phase 1 Challenges
TA-1 – Converged RF front end and apertureChallenges: (1) achieve useful multi-function RF performance on Group 3 UAS(2) integration of front end + aperture + UAS
TA-2 – RF virtual machine Challenges: (1) achieve computationally efficient hardware-agnostic mode implementation(2) meet agility, flexibility, and adaptability goals
TA-3 – System and sensor resource manager (SSRM) Challenge: resolve competition for common resource to achieve mission success across diverse objectives
TA-4 – System architecture and integrationChallenges: (1) mission analysis to quantify performance needs and mission impact (2) create viable architecture for converged payload
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 9
Domain Specific System on Chip (DSSoC)For streaming data close to the sensor
10
• Create a development ecosystem that takes advantage of the specialized hardware with no added burden to the programmer
• Design an intelligent scheduler for efficient data movement between DSSoC processor elements
• Build a DSSoC of advanced, heterogeneous processors and accelerators for software radio
DSSoC program will…
Graphics Processors
Neuromorphic
Accelerator
Digital Signal Processor
MemoryGeneral Purpose
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
DSSoC will enable rapid development of multi-application systems through a single programmable device
Examples of Processor Elements (PE)
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 11
DSSoC will rethink the software/hardware development stack
DSSoC’s Full-Stack Integration Today’s Programming Environment
Application
Compiler
Development Environment and Programming Languages
Libraries
Linker and Assembler
Operating System
MemoryManagement,Interconnects
Computer system architectures & component tech.
Com
pute
r Scie
nce
EE
Deco
uple
d pe
rform
ance
ana
lysis
Application
Inte
grat
ed p
erfo
rman
ce a
nalys
is
Development Environment and Programming Languages
Libraries
Operating System
Com
pile
r, lin
ker,
asse
mbl
er
Inte
lligen
t sch
edul
ing
Heterogeneous architecture composed of Processor ElementsExample PEs:• CPUs• Graphics processing units• Tensor product units• Neuromorphic units• Accelerators (e.g., FFT)• DSPs• Programmable logic• Math acceleratorsM
ediu
m A
cces
s Co
ntro
l
Dom
ain
Onto
logy
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 12
At the core of DSSoC is intelligent resource allocation• Design-time resource management
• Type of PEs• Number of each type of PE• Distribution of PEs across the SoC
• Run-time• Online updates to PE utilization • Support multiple, simultaneously running applications
• Compile-time• Static optimization
Program goals
Application
Inte
grat
ed p
erfo
rman
ce a
nalys
is
Development Environment and Programming Languages
Libraries
Operating System
Com
pile
r, lin
ker,
asse
mbl
er
Inte
lligen
t sch
edul
ing
Heterogeneous architecture composed of Processor ElementsExample PEs:• CPUs• Graphics processing units• Tensor product units• Neuromorphic units• Accelerators (e.g., FFT)• DSPs• Programmable logic• Math acceleratorsM
ediu
m A
cces
s Co
ntro
l
Dom
ain
Onto
logy
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 13
Design-time Optimization
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 14
On computable numbers
All numbersComputable Numbers
Domain0
App0 App1
App3
Domain1
App4
App5
App6
How do we scope a domain?• Is the domain large enough to
justify a market?• Do the problems in the domain
share enough similarity?• Does the domain adequately
group enough unique problems?
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 15
• 7 Dwarfs of High Performance Computing• Dense linear algebra• Sparse linear algebra• Spectral methods• N-Body methods• Structured grids• Unstructured grids• Monte Carlo
• Then there were 13• MapReduce (replaced Monte Carlo)• Combinatorial Logic• Graph Traversal• Dynamic Programming• Back-Track, Branch and Bound• Graphical Models• Finite State Machine
Mathematical genetics – motifs
• 7 Dwarfs of Symbolic Computing• Exact linear algebra, integer lattices• Exact polynomial and differential algebra, Grobner bases• Inverse symbolic problems• Tarski’s algebraic theory of real geometry• Hybrid symbolic-numeric computation• Computation of closed form solutions• Rewrite rule systems and computational group theory
https://cdn.quizzclub.com/questions/2016-09/what-did-the-7-dwarfs-do-for-a-job-in-snow-white-and-the-seven-dwarfs-
film.jpg
Can we be better about mapping these ideas to be more applicable to real compute problems?
Examples of DSSoC Processor Elements
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 16
New accelerators enable more efficient computing at the cost of added complexity to developers
Today’s SoCs are already complicated and difficult to program
Explosive growth of processor types and accelerators must be made useable.
Programmable logic Mali400
ARM Cortex A53 ARM Cortex R5
This will get worse
FFT Accel.
Neuromorphic
Graphics Processors
???
Accelerator
Digital Signal Processor
MemoryGeneral Purpose• DSSoC will determine an ontology
of co-processors for a domain• How will they be programmed?
CRAFT SoC
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 17
• Fourier Transforms are a big part of software radio• But, is this truly a representative algorithm?• How many do we need?• Specialization vs. Hyper-specialization• Location on chip – near or far? Distributed?
Example of an ontology member
Ontology tells us not just what kinds of processor elements (PE), but also how many and where they should be placed relative to other PEs.
1 x 1024
8 x 2N
1 x M
Effectively free computeCosts no powerRuns in zero time
Cheap but not freeSmall powerRuns quickly
Costs moreNeeded rarelyPossibly just use a CPU
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 18
Software radio is a representative and complex domain
Many applications in the software radio domain• Spectrum Management• Dynamic Spectrum Access• Wireless Internet• Satellite communications• Internet of Things• Radar, etc.
A common set of algorithms address many applications• Fourier transforms• Matrix operations• Control loops• Digital filters• Transcendental functions• Complex math• Error correcting codes
Upper-layerstack
Domains are a mathematical representation of a set of applications.
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 19
Run-time Optimization
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 20
Today we map streaming data algorithms to different processors by hand through large engineering efforts
Upper-layerstack
CPU 0
Core 0Core 2
Core 1Core 3
GPU 1mem
FPGA EmbeddedGPP/DSP
CPU 1
Core 0Core 2
Core 1Core 3GPU 0mem
Decouple programmers from need to optimize for the underlying hardware.
Mapping done by hand engineeringMoving between processors is overhead
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 21
Why did the TI Keystone II SoC fail for software radio?
http://www.ti.com/ds_dgm/images/fbd_sprs893e.gif
2 month effort to use the FFT accelerator
3 more months to use Turbo Decoder
• Lots of accelerators, but nearly impossible to use
• Basically an LTE basestation ASIC• Not useful for software radio
A GNU Radio application using these is a 30-minute exercise.
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 22
Software abstraction leads to inefficient uses of hardware
Upper-layerstack
• All of these blocks perform Fourier Transforms• Transforms are executed multiple times on the same data
Can we build hardware-level intelligence to recognize and optimize operations?
23
• Develop models of binaries and algorithms• Predict next optimal processor element given conditions like:• Optimality for math/algorithm• Distance/latency to move data• Current utilization of element• Thermal, power, environmentals• Dynamics of multiple applications
• Example: infer next element based on:• Line 1: Minimizes distance• Line 2: Optimal accelerator
Develop intelligent scheduling to move data between processor elements.
1 2
Graphics Processors
Neuromorphic
Accelerator
Digital Signal Processors
MemoryGeneral Purpose
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
The intelligent scheduler will enable efficient use of advanced sets of processor elements while maintaining program abstractions
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 24
We tried this before, so why did the IBM Cell Broadband Engine fail?
GNU Radio and the IBM Cell Broadband EngineNov 2005 Cell Released
Jan 2007 GNU Radio work on Cell began Two years to integrateDec 2008 GNU Radio Cell Scheduler and FFT ready Only 1 algorithm developed
Sept 2009 Intel Core i7 released Faster and easier to use
Nov 2009 Cell declared end of life Industry also noticed
http://www.spiral.net/graphics/cell-be.gif
PPE: PowerPC Processing ElementSPE: Synergistic Processing Element
• With the Intel i7, we continued to benefit from scaling and worked with existing tools
• Easy to build, debug, analyze, use
IBM Cell
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 25
Map characteristic microcode activity to DSSoC’s elements
• Develop models of binaries and algorithms• Image of a binary representation from Cyber
Grand Challenge• LLVM Intermediate Representation
Programming Model
Binary Representation
System of Processing Elements
Understand the shape/projection of algorithms to map to processor.
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 26
Compile-time Optimization
27
To intelligently schedule at runtime, the algorithm must be compiled to any possible PE that may be tasked to run it
Develop workflows/tools to build the algorithm for many different PEs.
1 2
Graphics Processors
Neuromorphic
Accelerator
Digital Signal Processors
MemoryGeneral Purpose
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Algorithm
Compiler
Object Codefor PE0
Object Codefor PEN
Object Codefor PE1
. . .
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 28
• Sometimes efficiency is in how you write the algorithm• Implementation details can matter• Use the right approach for the job• E.g., Convolution vs. Fast Convolution• E.g., Cooley-Tukey, Winograd, Good-Thomas, etc.
• Domain-specific libraries• Support autotuning, hand-optimization
• Example Libraries:• VOLK (Vector-Optimized Library of Kernels)
• Signal processing / 2-D vector math• Includes profiler to optimize to a processor
• FFTW (Fastest Fourier Transform in the West)• Fourier transforms, optimized for general purpose processors• Wisdom file to test and store optimal implementations
• BLAS (Basic Linear Algebra Subprograms)• Building blocks for linear algebra• Used in LINPACK benchmarks• Bundled into optimized libraries like ATLAS (Automatically
Tuned Linear Algebra Software)
Software libraries provide access to pre-vetted and optimized code
VOLK Kernel
Dispatcher
GenericC Code
SSE SSE2 AVX NEON(intrinsics)
NEON(assembly)
29
• Performance Monitoring• Collect, measure, and perform statistics• Power and temperature• Cache hits and misses• Should include PE introspection
• Internal performance counters• Size, resource, or other constraints on the type of computation or parameterization
• Debuggers• Operating Systems
Other software tools
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 30
Tools and a developer ecosystem are required to successfully introduce new computing technology
Fix the disconnect between hardware and software through vertical integration
DSSoC’s Full-Stack Integration Today’s Programming Environment
Application
Compiler
Development Environment and Programming Languages
Libraries
Linker and Assembler
Operating System
MemoryManagement,Interconnects
Computer system architectures & component tech.
Com
pute
r Scie
nce
EE
Deco
uple
d pe
rform
ance
ana
lysis
Application
Inte
grat
ed p
erfo
rman
ce a
nalys
is
Development Environment and Programming Languages
Libraries
Operating System
Com
pile
r, lin
ker,
asse
mbl
er
Inte
lligen
t sch
edul
ing
Heterogeneous architecture composed of Processor ElementsExample PEs:• CPUs• Graphics processing units• Tensor product units• Neuromorphic units• Accelerators (e.g., FFT)• DSPs• Programmable logic• Math acceleratorsM
ediu
m A
cces
s Co
ntro
l
Dom
ain
Onto
logy
Building a development ecosystem
A key to programmability is the development ecosystem
But a chip that can’t be used, integrated, and programmed is called sand
AdaptevaAnalog Devices-BlackFinAltairAlteraAmbricAMD-APUARM-MP/NeonARM-MaliAsocsAspexAxisSemiBOPSBoston CircuitsBrightscaleCalxedaCaviumCEVA
ChameleonClearspeedCognimemCognivueCognovoCoherent LogixCoreSonicCPUTechCradleCswitchDesignArtElementCXIEZChipFreescaleGreenarraysHPIBM-Cell
IBM-CyclopseIcera-PowerVRImagination-PowerVRImecInmos-TransputerIntel-TFLOPSIntel-LarrabeeIntel-MICIntellasysIntrinsityIPFLexKalrayMathstarMobileEyeModemArtMorphicsMorpho
MovidiusNECNetlogicNetronomeNvidiaOctasicPACTPanevePicochipPluralityQuicksilverRapportRaytheon-MonarchRecoreSandbridgeSiByteSiCortex
Silicon HiveSilicon SpiceSingular ComputingSound DesignSpiralGatewayStream ProcessorsStretchTabulaThinking MachinesTITileraTOPSVenrayXeleratedXilinxXMOSZiilabs
Parallel Processors
This list of processors suggests that solutions exist. So why are we here?
http://www.adapteva.com/andreas-blog/the-siren-song-of-parallel-computing/
Open source development allows community investment and
improvement to the ecosystem for the most robust solution.
https://opensource.orgOpen Source Initiative
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
32
Benefits of a rich development ecosystem
https://www.gitbook.com/book/tra38/essential-copying-and-pasting-from-stack-overflow/details
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 33
Program Details
34
Five program areas1. Intelligent scheduling
• Manage the set of domain resources• For multiple, simultaneously running applications
2. Software tools• Enables a development ecosystem• Exercise the full capability and make a highly programmable system
3. Domain representations• Build a domain ontology for PE selection
4. Medium access control (MAC)• Interconnect the PEs• Maximize data throughput, taking into consideration latency, power,
and other domain constraints5. Hardware integration
• Fabricate a DSSoC with the right set of PEs on the MAC layer• Show applications and the software tools running with the intelligent
scheduler
Program goals
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 35
It’s not just the processor: vector multiply on GPU over CPU
Data transfer overhead
Saturation ofparallelism
• GPU’s do better at computing convolutions (dense matrix multiplies)• Cost of data transfer means sometimes the CPU is more efficient• Resource optimization for multiple applications
Vector Multiply
x x x x x+
B
A
Result
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 36
• CHIPS program is investigating physical interface standards• DSSoC program is investigating programming interface standards• DSSoC will need:
• A well-defined, standardized interface• A medium access control (MAC) structure
• Global bus?• Network on Chip (NoC)?• Globally Asynchronous, Locally Synchronous (GALS)?• Crossbar? Mesh?
• Efficient• Low power, low latency
• Extensible• Easily add new PEs to plug in
• Programming interface to MAC• Address to any PE in SoC• Common data structure definition and handling• Scheduler hooks• Monitoring hooks
Medium Access Control – how to move data around the SoC
37
CHIPS interface standardCHIPS Program Interface Standard MetricsData rate 10 GpbsEnergy efficiency < 1 pJ/bitLatency < 5 nsBandwidth density > 1000 Gbps/mm
CHIPS Target
Source: Northrop Grumman
1
10
100
1,000
10,000
100,000
1,000,000
0.1 1 10 100 1000
Band
wid
th /
Ene
rgy
per b
it(G
bps/
mm
) / (p
J/bi
t)
Interconnect Distance (mm)
JSSC2016 - Dehlaghi, Single-ended, Al on Si
JSSC2013 - Poulton, Ground-ref. single-ended, Organic PCBJSSC2012 - Dickson, Differential, Cu on Si
JSSC2013 - Mansuri, Differential, TwinaxRibbon CableECTC2016 - Mahajan, EMIB
14nm SERDES, PCB
14nm HBM
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 38
• Benchmarking to be done against versions of DSSoC developed throughout the program• Phase 0: State-of-the-art commercial SoC
• v0 of intelligent scheduler running on a commercial SoC• Will have limited number of “PEs”• Ex. http://www.hsafoundation.com
• Phase 1: Emulation of DSSoC on discrete hardware• v1 of intelligent scheduler running on DSSoC emulation
• Phase 2: DSSoC0• v2 of intelligent scheduler running on first spin of DSSoC hardware• Results will inform the second spin of DSSoC
• Program schedule enforces a tight timeline for hardware updates
• Phase 3: DSSoC1• v3 of intelligent scheduler running on second spin of DSSoC hardware• 5 simultaneously running applications
DSSoC details
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 39
DSSoC program timeline
Software Tools
DSSoCHardware & Accelerators
Intelligent Scheduler
DSSoC1
Application Development
Continuous development and support
SoC
Available SoC
Co-design and injection of improved software and schedulers
MAC Interface Definition
Ontology
6 12 18 24 30 36 42 480
phase 0 phase 2 phase 3phase 1
Hypothesis H0 Test H0 Test H1Hypothesis H1
Version v0 v1 v2 v3
Test onCDR emulation
Test oncurrent SoC
Test onDSSoC0
Test onDSSoC1
CDREmulated on discrete HW DSSoC0 DSSoC1
DSSoC1
Emulated DSSoC
≥2 RF applications ≥5 RF applications≥1 RF application ≥1 RF application
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 40
Program metrics
Phase 1 Phase 2 Phase 3
Chip & Scheduler
Number of simultaneous apps ≥2 ≥2 ≥5
Integration time for new accelerators1 ≤3 months ≤3 months
Power savings relative to previous phase ≤80%2 ≤80%3
Utilization of PEs4 ≥80% ≥90%
Max. time per scheduler decision ≤500 ns ≤50 ns ≤5 ns
MAC
Latency (PE to PE) ≤500 ns ≤50 ns ≤5 ns
Throughput (PE to PE) ≥25 Gbps ≥50 Gbps ≥100 Gbps
Power ≤50% of chip ≤40% of chip ≤20% of chip
1. Three months to integrate new accelerators into DSSoC; enforced by program timeline2. Compare the intelligent scheduler on DSSoC0 to the intelligent scheduler controlling the commercial SoC from phase 0.3. Compare the intelligent scheduler on DSSoC1 to the intelligent scheduler on DSSoC0.4. Ontology explains the required PEs and utilization; measure average utilization over developed apps.
41
• Tools and a developer ecosystem are required to successfully introduce new computing technology
• This is core to DSSoC• HW/SW Co-design• Teaming• Responsive to the full program – not split into TAs
1. Intelligent scheduling2. Software3. Domain representations4. Medium access control (MAC)5. Hardware integration
• Looking for actual chip prototypes
Wrap-up
Application
Inte
grat
ed p
erfo
rman
ce a
nalys
is
Development Environment and Programming Languages
Libraries
Operating System
Com
pile
r, lin
ker,
asse
mbl
er
Inte
lligen
t sch
edul
ing
Heterogeneous architecture composed of Processor ElementsExample PEs:• CPUs• Graphics processing units• Tensor product units• Neuromorphic units• Accelerators (e.g., FFT)• DSPs• Programmable logic• Math acceleratorsM
ediu
m A
cces
s Co
ntro
l
Dom
ain
Onto
logy
Design timeRun timeCompile time
Optimizing at
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)
www.darpa.mil
42