Electronics Resurgence Initiative: Architectures · 2017-10-02 · Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) Electronics Resurgence Initiative:

Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)

Electronics Resurgence Initiative:Architectures

Tom RondeauMicrosystems Technology Office

DSSoC Proposer’s Day

09/18/2017


Who am I?

• Previous project lead for GNU Radio• Researcher then Adjunct with IDA’s Center for Communication Research• Researcher at UPenn

Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) 3

Can we have both programmability and specialization?

Matrix Multiply (ISAT 2012 study)

Programmability• Productivity has come at the cost

of compute efficiency• Abstraction tends to ignore the

underlying hardware

Specialization• Performance has come at the

cost of usability• Difficulty in programming

and system integration

Goal

moreprogrammable

lessprogrammable

notprogrammableEn

ergy

Effi

cienc

y (M

OP/m

W)


ERI Page 3: Architectures

Build new processors that solve the significant computing needs of today’s and tomorrow’s applications.

1: Domain Specific System on Chip (DSSoC)Streaming Data is latency sensitive, small but many work loads

2: Software Defined Hardware (SDH)Big Data is efficiency sensitive, large and repeatable work loads

DSSoC

SDH


Streaming DataLatency sensitive, small but many work loads


• Economic challenges specific to the DoD• Market size• Uniqueness of problems• Breadth of important but small problems

Difficult to support cost-effective solutions

• Technical challenges• Programmability• Integration of single-application products• Integration/test

Needs evolve faster than we can develop solutions

National security issues and impact

The DoD cannot afford the limited time and high cost of the programmer heroes.

APG-77 AESA Software Summary

http://washingtoniceaa.com/files/presentations/SOFTWARE_MAINTENANCE_O&M_COST.pdf

Converged Collaborative Elements for RF Task Operations (CONCERTO) [Ted Woodward]

7

UNCLASSIFIED//FOUO

CONCERTO Vision: RF Convergence

Today’s Constrained Systems

Manager (Radar)

RF Front End

Antennas / Apertures /

Airframe

Modes on Digital

Processor

Manager (EW)

RF Front End


Airframe

Modes on Digital

Processor

Manager (Comms)

RF Front End


Airframe

Modes on Digital

Processors

System and Sensor

Resource Manager

Adaptable Aperture

integrated w/ Airframe

RF Modes

Hetero-generous Processor

Unified RF Front end

Abstract

• OBJECTIVE: Develop converged RF system with radar, electronic warfare, and communications modes to enable new approaches to tactical RF missions

• Three Phases• Phase 1 (current phase): study missions,

establish subsystem technical readiness, create new RF systems architecture

• Phase 2: design prototype RF system• Phase 3: build and demonstrate the

prototype RF system in a flight test

UAS: Unmanned Aircraft System

End of program outputs• More capability on smaller UAS hosts• RF virtual machine supports portable RF modes • Intelligent System and Sensor Resource Manager • Unified, scalable design showing new behavior:

maneuver dynamically in spectrum, time, and spaceDistribution Statement “A” (Approved for Public Release, Distribution Unlimited)

8

CONCERTO Phase 1 Challenges

TA-1 – Converged RF front end and apertureChallenges: (1) achieve useful multi-function RF performance on Group 3 UAS(2) integration of front end + aperture + UAS

TA-2 – RF virtual machine Challenges: (1) achieve computationally efficient hardware-agnostic mode implementation(2) meet agility, flexibility, and adaptability goals

TA-3 – System and sensor resource manager (SSRM) Challenge: resolve competition for common resource to achieve mission success across diverse objectives

TA-4 – System architecture and integrationChallenges: (1) mission analysis to quantify performance needs and mission impact (2) create viable architecture for converged payload



Domain Specific System on Chip (DSSoC)For streaming data close to the sensor

10

• Create a development ecosystem that takes advantage of the specialized hardware with no added burden to the programmer

• Design an intelligent scheduler for efficient data movement between DSSoC processor elements

• Build a DSSoC of advanced, heterogeneous processors and accelerators for software radio

DSSoC program will…

Graphics Processors

Neuromorphic

Accelerator

Digital Signal Processor

MemoryGeneral Purpose


DSSoC will enable rapid development of multi-application systems through a single programmable device

Examples of Processor Elements (PE)


DSSoC will rethink the software/hardware development stack

DSSoC’s Full-Stack Integration Today’s Programming Environment

Application

Compiler

Development Environment and Programming Languages

Libraries

Linker and Assembler

Operating System

MemoryManagement,Interconnects

Computer system architectures & component tech.

Com

pute

r Scie

nce

EE

Deco

uple

d pe

rform

ance

ana

lysis

Application

Inte

grat

ed p

erfo

rman

ce a

nalys

is


Libraries

Operating System

Com

pile

r, lin

ker,

asse

mbl

er

Inte

lligen

t sch

edul

ing

Heterogeneous architecture composed of Processor ElementsExample PEs:• CPUs• Graphics processing units• Tensor product units• Neuromorphic units• Accelerators (e.g., FFT)• DSPs• Programmable logic• Math acceleratorsM

ediu

m A

cces

s Co

ntro

l

Dom

ain

Onto

logy


At the core of DSSoC is intelligent resource allocation• Design-time resource management

• Type of PEs• Number of each type of PE• Distribution of PEs across the SoC

• Run-time• Online updates to PE utilization • Support multiple, simultaneously running applications

• Compile-time• Static optimization

Program goals

Application

Inte

grat

ed p

erfo

rman

ce a

nalys

is


Libraries

Operating System

Com

pile

r, lin

ker,

asse

mbl

er

Inte

lligen

t sch

edul

ing


ediu

m A

cces

s Co

ntro

l

Dom

ain

Onto

logy


Design-time Optimization


On computable numbers

All numbersComputable Numbers

Domain0

App0 App1

App3

Domain1

App4

App5

App6

How do we scope a domain?• Is the domain large enough to

justify a market?• Do the problems in the domain

share enough similarity?• Does the domain adequately

group enough unique problems?


• 7 Dwarfs of High Performance Computing• Dense linear algebra• Sparse linear algebra• Spectral methods• N-Body methods• Structured grids• Unstructured grids• Monte Carlo

• Then there were 13• MapReduce (replaced Monte Carlo)• Combinatorial Logic• Graph Traversal• Dynamic Programming• Back-Track, Branch and Bound• Graphical Models• Finite State Machine

Mathematical genetics – motifs

• 7 Dwarfs of Symbolic Computing• Exact linear algebra, integer lattices• Exact polynomial and differential algebra, Grobner bases• Inverse symbolic problems• Tarski’s algebraic theory of real geometry• Hybrid symbolic-numeric computation• Computation of closed form solutions• Rewrite rule systems and computational group theory

https://cdn.quizzclub.com/questions/2016-09/what-did-the-7-dwarfs-do-for-a-job-in-snow-white-and-the-seven-dwarfs-

film.jpg

Can we be better about mapping these ideas to be more applicable to real compute problems?

Examples of DSSoC Processor Elements


New accelerators enable more efficient computing at the cost of added complexity to developers

Today’s SoCs are already complicated and difficult to program

Explosive growth of processor types and accelerators must be made useable.

Programmable logic Mali400

ARM Cortex A53 ARM Cortex R5

This will get worse

FFT Accel.

Neuromorphic

Graphics Processors

???

Accelerator

Digital Signal Processor

MemoryGeneral Purpose• DSSoC will determine an ontology

of co-processors for a domain• How will they be programmed?

CRAFT SoC


• Fourier Transforms are a big part of software radio• But, is this truly a representative algorithm?• How many do we need?• Specialization vs. Hyper-specialization• Location on chip – near or far? Distributed?

Example of an ontology member

Ontology tells us not just what kinds of processor elements (PE), but also how many and where they should be placed relative to other PEs.

1 x 1024

8 x 2N

1 x M

Effectively free computeCosts no powerRuns in zero time

Cheap but not freeSmall powerRuns quickly

Costs moreNeeded rarelyPossibly just use a CPU


Software radio is a representative and complex domain

Many applications in the software radio domain• Spectrum Management• Dynamic Spectrum Access• Wireless Internet• Satellite communications• Internet of Things• Radar, etc.

A common set of algorithms address many applications• Fourier transforms• Matrix operations• Control loops• Digital filters• Transcendental functions• Complex math• Error correcting codes

Upper-layerstack

Domains are a mathematical representation of a set of applications.


Run-time Optimization


Today we map streaming data algorithms to different processors by hand through large engineering efforts

Upper-layerstack

CPU 0

Core 0Core 2

Core 1Core 3

GPU 1mem

FPGA EmbeddedGPP/DSP

CPU 1

Core 0Core 2

Core 1Core 3GPU 0mem

Decouple programmers from need to optimize for the underlying hardware.

Mapping done by hand engineeringMoving between processors is overhead


Why did the TI Keystone II SoC fail for software radio?

http://www.ti.com/ds_dgm/images/fbd_sprs893e.gif

2 month effort to use the FFT accelerator

3 more months to use Turbo Decoder

• Lots of accelerators, but nearly impossible to use

• Basically an LTE basestation ASIC• Not useful for software radio

A GNU Radio application using these is a 30-minute exercise.


Software abstraction leads to inefficient uses of hardware

Upper-layerstack

• All of these blocks perform Fourier Transforms• Transforms are executed multiple times on the same data

Can we build hardware-level intelligence to recognize and optimize operations?

23

• Develop models of binaries and algorithms• Predict next optimal processor element given conditions like:• Optimality for math/algorithm• Distance/latency to move data• Current utilization of element• Thermal, power, environmentals• Dynamics of multiple applications

• Example: infer next element based on:• Line 1: Minimizes distance• Line 2: Optimal accelerator

Develop intelligent scheduling to move data between processor elements.

1 2

Graphics Processors

Neuromorphic

Accelerator

Digital Signal Processors



The intelligent scheduler will enable efficient use of advanced sets of processor elements while maintaining program abstractions


We tried this before, so why did the IBM Cell Broadband Engine fail?

GNU Radio and the IBM Cell Broadband EngineNov 2005 Cell Released

Jan 2007 GNU Radio work on Cell began Two years to integrateDec 2008 GNU Radio Cell Scheduler and FFT ready Only 1 algorithm developed

Sept 2009 Intel Core i7 released Faster and easier to use

Nov 2009 Cell declared end of life Industry also noticed

http://www.spiral.net/graphics/cell-be.gif

PPE: PowerPC Processing ElementSPE: Synergistic Processing Element

• With the Intel i7, we continued to benefit from scaling and worked with existing tools

• Easy to build, debug, analyze, use

IBM Cell


Map characteristic microcode activity to DSSoC’s elements

• Develop models of binaries and algorithms• Image of a binary representation from Cyber

Grand Challenge• LLVM Intermediate Representation

Programming Model

Binary Representation

System of Processing Elements

Understand the shape/projection of algorithms to map to processor.


Compile-time Optimization

27

To intelligently schedule at runtime, the algorithm must be compiled to any possible PE that may be tasked to run it

Develop workflows/tools to build the algorithm for many different PEs.

1 2

Graphics Processors

Neuromorphic

Accelerator

Digital Signal Processors



Algorithm

Compiler

Object Codefor PE0

Object Codefor PEN

Object Codefor PE1

. . .


• Sometimes efficiency is in how you write the algorithm• Implementation details can matter• Use the right approach for the job• E.g., Convolution vs. Fast Convolution• E.g., Cooley-Tukey, Winograd, Good-Thomas, etc.

• Domain-specific libraries• Support autotuning, hand-optimization

• Example Libraries:• VOLK (Vector-Optimized Library of Kernels)

• Signal processing / 2-D vector math• Includes profiler to optimize to a processor

• FFTW (Fastest Fourier Transform in the West)• Fourier transforms, optimized for general purpose processors• Wisdom file to test and store optimal implementations

• BLAS (Basic Linear Algebra Subprograms)• Building blocks for linear algebra• Used in LINPACK benchmarks• Bundled into optimized libraries like ATLAS (Automatically

Tuned Linear Algebra Software)

Software libraries provide access to pre-vetted and optimized code

VOLK Kernel

Dispatcher

GenericC Code

SSE SSE2 AVX NEON(intrinsics)

NEON(assembly)

29

• Performance Monitoring• Collect, measure, and perform statistics• Power and temperature• Cache hits and misses• Should include PE introspection

• Internal performance counters• Size, resource, or other constraints on the type of computation or parameterization

• Debuggers• Operating Systems

Other software tools



Tools and a developer ecosystem are required to successfully introduce new computing technology

Fix the disconnect between hardware and software through vertical integration

DSSoC’s Full-Stack Integration Today’s Programming Environment

Application

Compiler


Libraries

Linker and Assembler

Operating System

MemoryManagement,Interconnects

Computer system architectures & component tech.

Com

pute

r Scie

nce

EE

Deco

uple

d pe

rform

ance

ana

lysis

Application

Inte

grat

ed p

erfo

rman

ce a

nalys

is


Libraries

Operating System

Com

pile

r, lin

ker,

asse

mbl

er

Inte

lligen

t sch

edul

ing


ediu

m A

cces

s Co

ntro

l

Dom

ain

Onto

logy

Building a development ecosystem

A key to programmability is the development ecosystem

But a chip that can’t be used, integrated, and programmed is called sand

AdaptevaAnalog Devices-BlackFinAltairAlteraAmbricAMD-APUARM-MP/NeonARM-MaliAsocsAspexAxisSemiBOPSBoston CircuitsBrightscaleCalxedaCaviumCEVA

ChameleonClearspeedCognimemCognivueCognovoCoherent LogixCoreSonicCPUTechCradleCswitchDesignArtElementCXIEZChipFreescaleGreenarraysHPIBM-Cell

IBM-CyclopseIcera-PowerVRImagination-PowerVRImecInmos-TransputerIntel-TFLOPSIntel-LarrabeeIntel-MICIntellasysIntrinsityIPFLexKalrayMathstarMobileEyeModemArtMorphicsMorpho

MovidiusNECNetlogicNetronomeNvidiaOctasicPACTPanevePicochipPluralityQuicksilverRapportRaytheon-MonarchRecoreSandbridgeSiByteSiCortex

Silicon HiveSilicon SpiceSingular ComputingSound DesignSpiralGatewayStream ProcessorsStretchTabulaThinking MachinesTITileraTOPSVenrayXeleratedXilinxXMOSZiilabs

Parallel Processors

This list of processors suggests that solutions exist. So why are we here?

http://www.adapteva.com/andreas-blog/the-siren-song-of-parallel-computing/

Open source development allows community investment and

improvement to the ecosystem for the most robust solution.

https://opensource.orgOpen Source Initiative


32

Benefits of a rich development ecosystem

https://www.gitbook.com/book/tra38/essential-copying-and-pasting-from-stack-overflow/details



Program Details

34

Five program areas1. Intelligent scheduling

• Manage the set of domain resources• For multiple, simultaneously running applications

2. Software tools• Enables a development ecosystem• Exercise the full capability and make a highly programmable system

3. Domain representations• Build a domain ontology for PE selection

4. Medium access control (MAC)• Interconnect the PEs• Maximize data throughput, taking into consideration latency, power,

and other domain constraints5. Hardware integration

• Fabricate a DSSoC with the right set of PEs on the MAC layer• Show applications and the software tools running with the intelligent

scheduler

Program goals



It’s not just the processor: vector multiply on GPU over CPU

Data transfer overhead

Saturation ofparallelism

• GPU’s do better at computing convolutions (dense matrix multiplies)• Cost of data transfer means sometimes the CPU is more efficient• Resource optimization for multiple applications

Vector Multiply

x x x x x+

B

A

Result


• CHIPS program is investigating physical interface standards• DSSoC program is investigating programming interface standards• DSSoC will need:

• A well-defined, standardized interface• A medium access control (MAC) structure

• Global bus?• Network on Chip (NoC)?• Globally Asynchronous, Locally Synchronous (GALS)?• Crossbar? Mesh?

• Efficient• Low power, low latency

• Extensible• Easily add new PEs to plug in

• Programming interface to MAC• Address to any PE in SoC• Common data structure definition and handling• Scheduler hooks• Monitoring hooks

Medium Access Control – how to move data around the SoC

37

CHIPS interface standardCHIPS Program Interface Standard MetricsData rate 10 GpbsEnergy efficiency < 1 pJ/bitLatency < 5 nsBandwidth density > 1000 Gbps/mm

CHIPS Target

Source: Northrop Grumman

1

10

100

1,000

10,000

100,000

1,000,000

0.1 1 10 100 1000

Band

wid

th /

Ene

rgy

per b

it(G

bps/

mm

) / (p

J/bi

t)

Interconnect Distance (mm)

JSSC2016 - Dehlaghi, Single-ended, Al on Si

JSSC2013 - Poulton, Ground-ref. single-ended, Organic PCBJSSC2012 - Dickson, Differential, Cu on Si

JSSC2013 - Mansuri, Differential, TwinaxRibbon CableECTC2016 - Mahajan, EMIB

14nm SERDES, PCB

14nm HBM



• Benchmarking to be done against versions of DSSoC developed throughout the program• Phase 0: State-of-the-art commercial SoC

• v0 of intelligent scheduler running on a commercial SoC• Will have limited number of “PEs”• Ex. http://www.hsafoundation.com

• Phase 1: Emulation of DSSoC on discrete hardware• v1 of intelligent scheduler running on DSSoC emulation

• Phase 2: DSSoC0• v2 of intelligent scheduler running on first spin of DSSoC hardware• Results will inform the second spin of DSSoC

• Program schedule enforces a tight timeline for hardware updates

• Phase 3: DSSoC1• v3 of intelligent scheduler running on second spin of DSSoC hardware• 5 simultaneously running applications

DSSoC details


DSSoC program timeline

Software Tools

DSSoCHardware & Accelerators

Intelligent Scheduler

DSSoC1

Application Development

Continuous development and support

SoC

Available SoC

Co-design and injection of improved software and schedulers

MAC Interface Definition

Ontology

6 12 18 24 30 36 42 480

phase 0 phase 2 phase 3phase 1

Hypothesis H0 Test H0 Test H1Hypothesis H1

Version v0 v1 v2 v3

Test onCDR emulation

Test oncurrent SoC

Test onDSSoC0

Test onDSSoC1

CDREmulated on discrete HW DSSoC0 DSSoC1

DSSoC1

Emulated DSSoC

≥2 RF applications ≥5 RF applications≥1 RF application ≥1 RF application


Program metrics

Phase 1 Phase 2 Phase 3

Chip & Scheduler

Number of simultaneous apps ≥2 ≥2 ≥5

Integration time for new accelerators1 ≤3 months ≤3 months

Power savings relative to previous phase ≤80%2 ≤80%3

Utilization of PEs4 ≥80% ≥90%

Max. time per scheduler decision ≤500 ns ≤50 ns ≤5 ns

MAC

Latency (PE to PE) ≤500 ns ≤50 ns ≤5 ns

Throughput (PE to PE) ≥25 Gbps ≥50 Gbps ≥100 Gbps

Power ≤50% of chip ≤40% of chip ≤20% of chip

1. Three months to integrate new accelerators into DSSoC; enforced by program timeline2. Compare the intelligent scheduler on DSSoC0 to the intelligent scheduler controlling the commercial SoC from phase 0.3. Compare the intelligent scheduler on DSSoC1 to the intelligent scheduler on DSSoC0.4. Ontology explains the required PEs and utilization; measure average utilization over developed apps.

41

• Tools and a developer ecosystem are required to successfully introduce new computing technology

• This is core to DSSoC• HW/SW Co-design• Teaming• Responsive to the full program – not split into TAs

1. Intelligent scheduling2. Software3. Domain representations4. Medium access control (MAC)5. Hardware integration

• Looking for actual chip prototypes

Wrap-up

Application

Inte

grat

ed p

erfo

rman

ce a

nalys

is


Libraries

Operating System

Com

pile

r, lin

ker,

asse

mbl

er

Inte

lligen

t sch

edul

ing


ediu

m A

cces

s Co

ntro

l

Dom

ain

Onto

logy

Design timeRun timeCompile time

Optimizing at


www.darpa.mil

42

Documents

Electronics Resurgence Initiative: Architectures · 2017-10-02 · Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) Electronics Resurgence Initiative: