Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad...
If you can't read please download the document
Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach,
Algorithms and Specializers for Provably Optimal
Implementations with Resiliency and Efficiency Elad Alon, Krste
Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox,
Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John
Wawrzynek [email protected]
http://aspire.eecs.berkeley.edu
Slide 2
UC Berkeley Future Application Drivers 2
Slide 3
UC Berkeley Compute Energy Iron Law When power is constrained,
need better energy efficiency for more performance Where
performance is constrained (real-time), want better energy
efficiency to lower power Improving energy Efficiency is critical
goal for all future systems and workloads 3 Performance = Power *
Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule)
Slide 4
UC Berkeley Good News: Moores Law Continues 4 Cramming more
components onto integrated circuits, Gordon E. Moore, Electronics,
1965 More Transistors/Chip Cheaper!
Slide 5
UC Berkeley Bad News: Dennard (Voltage) Scaling Over 5 Dennard
Scaling Post-Dennard Scaling Moore, ISSCC Keynote, 2003
Slide 6
UC Berkeley 1 st Impact of End of Scaling: End of Sequential
Processor Era 6
Slide 7
UC Berkeley Parallelism: A one-time gain Use more, slower cores
for better energy efficiency. Either simpler cores, or run cores at
lower Vdd/frequency Even simpler general-purpose
microarchitectures? -Limited by smallest sensible core Even Lower
Vdd/Frequency? -Limited by Vdd/Vt scaling, errors Now what? 7
Slide 8
UC Berkeley [Muller, ARM CTO, 2009] 2 nd Impact of End of
Scaling: Dark Silicon Cannot switch all transistors at full
frequency! 8 No savior device technology on horizon. Future
energy-efficiency innovations must be above transistor level.
Slide 9
UC Berkeley The End of General-Purpose Processors? Most
computing happens in specialized, heterogeneous processors -Can be
100-1000X more efficient than general-purpose processor Challenges:
-Hardware design costs -Software development costs 9 NVIDIA
Tegra2
Slide 10
UC Berkeley The Real Scaling Challenge: Communication As
transistors become smaller and cheaper, communication dominates
performance and energy 10 All scales: Across chip Up and down
memory hierarchy Chip-to-chip Board-to-board Rack-to-rack
Slide 11
UC Berkeley ASPIRE: From Better to Best What is the best we can
do? -For a fixed target technology (e.g., 7nm) Can we prove a
bound? Can we design implementation approaching bound? P rovably
Optimal I mplementations 11 Specialize and optimize communication
and computation across whole stack from applications to
hardware
Slide 12
UC Berkeley Communication-Avoiding Algorithms: Algorithm Cost
Measures 1. Arithmetic (FLOPS) 2. Communication: moving data
between -levels of a memory hierarchy (sequential case) -processors
over a network (parallel case). 12 CPU Cache DRAM CPU DRAM CPU DRAM
CPU DRAM CPU DRAM
Slide 13
UC Berkeley Modeling Runtime & Energy 13
Slide 14
UC Berkeley A few examples of speedups Matrix multiplication
-Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less
communication QR decomposition (used in least squares, data mining,
) -Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10
-Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200 -Up to
13x on Tesla C2050 / Fermi, for 110k x 100 -Up to 4x on Grid of 4
cities (Dongarra, Langou et al) -infinite speedup for out-of-core
on PowerPC laptop -LAPACK thrashed virtual memory, didnt finish
Eigenvalues of band symmetric matrices -Up to 17x on Intel
Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential) Iterative
sparse linear equations solvers (GMRES) -Up to 4.3x on Intel
Clovertown, 8 core N-body (direct particle interactions with cutoff
distance) -Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K
procs. 14
Slide 15
UC Berkeley Modeling Energy: Dynamic 15
Slide 16
UC Berkeley Modeling Energy: Memory Retention 16
Slide 17
UC Berkeley Modeling Energy: Background Power 17
Slide 18
UC Berkeley Energy Lower Bounds 18
Slide 19
UC Berkeley Early Result: Perfect Strong Scaling in Time and
Energy Every time you add processor, use its memory M too Start
with minimal number of procs: PM = 3n 2 Increase P by factor c
total memory increases by factor c Notation for timing model: - t,
t, t = secs per flop, per word_moved, per message of size m T(cP) =
n 3 /(cP) [ T + t /M 1/2 + t /(mM 1/2 ) ] = T(P)/c Notation for
energy model: - e, e, e = Joules for same operations - e = Joules
per word of memory used per sec - e = Joules per sec for leakage,
etc. E(cP) = cP { n 3 /(cP) [ e + e /M 1/2 + e /(mM 1/2 ) ] + e
MT(cP) + ET(cP) } = E(P) Perfect scaling extends to n-body,
Strassen, [IPDPS, 2013] 19
Slide 20
UC Berkeley C-A Algorithms Not Just for HPC In ASPIRE, apply to
other key application areas: machine vision, databases, speech
recognition, software-defined radio, Initial results on lower
bounds of database join algorithms 20
Slide 21
UC Berkeley From C-A Algorithms to Provably Optimal Systems? 1)
Prove lower bounds on communication for a computation 2) Develop
algorithm that achieves lower bound on a system 3) Find that
communication time/energy cost is >90% of resulting
implementation 4) We know were within 10% of optimal! Supporting
technique: Optimizing software stack and compute engines to reduce
compute costs and expose unavoidable communication costs 21
Slide 22
UC Berkeley ESP: An Applications Processor Architecture for
ASPIRE Future server and mobile SoCs will have many fixed- function
accelerators and a general-purpose programmable multicore
Well-known how to customize hardware engines for specific task ESP
challenge is using specialized engines for general-purpose code 22
Intel Ivy Bridge (22nm) Qualcomm Snapdragon MSM8960 (28nm)
Slide 23
UC Berkeley ESP: Ensembles of Specialized Processors
General-purpose hardware, flexible but inefficient Fixed-function
hardware, efficient but inflexible Par Lab Insight: Patterns
capture common operations across many applications, each with
unique communication & computation structure Build an ensemble
of specialized engines, each individually optimized for particular
pattern but collectively covering application needs Bet: Will give
us efficiency plus flexibility -Any given core can have a different
mix of these depending on workload 23
Slide 24
UC Berkeley Par Lab: Motifs common across apps 24
DenseGraphSparse Audio Recognition Object Recognition Scene
Analysis
Slide 25
UC Berkeley 25 Par Lab Apps Blue Cool Motif (nee Dwarf)
Popularity (Red Hot / Blue Cool) Computing Domains
UC Berkeley Mapping Software to ESP: Specializers Capture
desired functionality at high-level using patterns in a productive
high-level language Use pattern-specific compilers (Specializers)
with autotuners to produce efficient low-level code ASP specializer
infrastructure, open-source download 27 ILP Engine Dense Engine
Sparse Engine Graph Engine Glue Code Dense Code Sparse Code Graph
Code DenseGraphSparse Audio Recognition Object Recognition Scene
Analysis
Slide 28
UC Berkeley Replacing Fixed Accelerators with Programmable
Fabric Future server and mobile SoCs will have many fixed- function
accelerators and a general-purpose programmable multicore Fabric
challenge is retaining extreme energy efficiency while retaining
programmability 28 Intel Ivy Bridge (22nm) Qualcomm Snapdragon
MSM8960 (28nm)
Slide 29
UC Berkeley Strawman Fabric Architecture 29 M A R M A R M A R M
A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A
R M A R Will never have a C compiler Only programmed using
pattern-based DSLs More dynamic, less static than earlier
approaches Dynamic dataflow-driven execution Dynamic routing Large
memory support
Slide 30
UC Berkeley Agile Hardware Development Current hardware design
slow and arduous But now have huge design space to explore How to
examine many design points efficiently? Build parameterized
generators, not point designs! Adopt and adapt best practices from
Agile Software -Complete LVS-DRC clean physical design of current
version every ~ two weeks (tapein) -Incremental feature addition
-Test & Verification first step 30
Slide 31
UC Berkeley Chisel: Constructing Hardware In a Scala Embedded
Language Embed a hardware-description language in Scala, using
Scalas extension facilities A hardware module is just a data
structure in Scala Different output routines can generate different
types of output (C, FPGA-Verilog, ASIC-Verilog) from same hardware
representation Full power of Scala for writing hardware generators
-Object-Oriented: Factory objects, traits, overloading etc
-Functional: Higher-order funcs, anonymous funcs, currying
-Compiles to JVM: Good performance, Java interoperability 31
Slide 32
UC Berkeley Chisel Design Flow 32 Chisel Program C++ code FPGA
Verilog ASIC Verilog Software Simulator C++ Compiler Scala/JVM FPGA
Emulation FPGA Tools GDS Layout ASIC Tools
Slide 33
UC Berkeley Chisel is much more than an HDL The base Chisel
system allows you to use the full power of Scala to describe the
RTL of a design, then generate Verilog or C++ output from the RTL
But Chisel can be extended above with domain-specific languages
(e.g., signal processing) for fabric Importantly, Chisel can also
be extended below with new backends or to add new tools or features
(e.g., quantum computing circuits) Only ~6,000 lines of code in
current version including libraries! BSD-licensed open source at:
chisel.eecs.berkeley.edu 33
Slide 34
UC Berkeley Many processor tapeouts in few years with small
group (45nm, 28nm) 34 Clock test site SRAM test site DCDC test site
Processor Site CORE 0 VC0 CORE 1 VC1 CORE 2 VC2 CORE 3 VC3 512KB L2
VFIXED Test Sites
Slide 35
UC Berkeley Resilient Circuits & Modeling Future scaled
technologies have high variability but want to run with
lowest-possible margins to save energy Significant increase in soft
errors, need resilient systems Technology modeling to determine
tradeoff between MTBF and energy per task for logic, SRAM, &
interconnect. 35 Techniques to reduce operating voltage can be
worse for energy due to rapid rise in errors
Slide 36
UC Berkeley Algorithms and Specializers for Provably Optimal
Implementations with Resiliency and Efficiency 36 DenseGraphSparse
C++ Simulation FPGA Emulation Audio Recognition Object Recognition
Scene Analysis Hardware Cache Coherence ASIC SoC FPGA Computer
Communication-Avoiding Algorithms C-A GEMM C-A BFS C-A SpMV
Pipe&FilterMap-Reduce ILP Engine Dense Engine Sparse Engine
Graph Engine Local Stores + DMA Glue Code Dense Code Sparse Code
Graph Code
Slide 37
UC Berkeley ASPIRE Project Initial $15.6M/5.5 year funding from
DARPA PERFECT program -Started 9/28/2012 -Located in Par Lab space
+ BWRC Looking for industrial affiliates (see Krste!) Open House
today, 5 th floor Soda Hall 37 Research funded by DARPA Award
Number HR0011-12-2-0016. Approved for public release; distribution
is unlimited. The content of this presentation does not necessarily
reflect the position or the policy of the US government and no
official endorsement should be inferred.