Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach,

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John Wawrzynek [email protected] http://aspire.eecs.berkeley.edu

UC Berkeley Future Application Drivers 2

UC Berkeley Compute Energy Iron Law When power is constrained, need better energy efficiency for more performance Where performance is constrained (real-time), want better energy efficiency to lower power Improving energy Efficiency is critical goal for all future systems and workloads 3 Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule)

UC Berkeley Good News: Moores Law Continues 4 Cramming more components onto integrated circuits, Gordon E. Moore, Electronics, 1965 More Transistors/Chip Cheaper!

UC Berkeley Bad News: Dennard (Voltage) Scaling Over 5 Dennard Scaling Post-Dennard Scaling Moore, ISSCC Keynote, 2003

UC Berkeley 1 st Impact of End of Scaling: End of Sequential Processor Era 6

UC Berkeley Parallelism: A one-time gain Use more, slower cores for better energy efficiency. Either simpler cores, or run cores at lower Vdd/frequency Even simpler general-purpose microarchitectures? -Limited by smallest sensible core Even Lower Vdd/Frequency? -Limited by Vdd/Vt scaling, errors Now what? 7

UC Berkeley [Muller, ARM CTO, 2009] 2 nd Impact of End of Scaling: Dark Silicon Cannot switch all transistors at full frequency! 8 No savior device technology on horizon. Future energy-efficiency innovations must be above transistor level.

UC Berkeley The End of General-Purpose Processors? Most computing happens in specialized, heterogeneous processors -Can be 100-1000X more efficient than general-purpose processor Challenges: -Hardware design costs -Software development costs 9 NVIDIA Tegra2

UC Berkeley The Real Scaling Challenge: Communication As transistors become smaller and cheaper, communication dominates performance and energy 10 All scales: Across chip Up and down memory hierarchy Chip-to-chip Board-to-board Rack-to-rack

UC Berkeley ASPIRE: From Better to Best What is the best we can do? -For a fixed target technology (e.g., 7nm) Can we prove a bound? Can we design implementation approaching bound? P rovably Optimal I mplementations 11 Specialize and optimize communication and computation across whole stack from applications to hardware

UC Berkeley Communication-Avoiding Algorithms: Algorithm Cost Measures 1. Arithmetic (FLOPS) 2. Communication: moving data between -levels of a memory hierarchy (sequential case) -processors over a network (parallel case). 12 CPU Cache DRAM CPU DRAM CPU DRAM CPU DRAM CPU DRAM

UC Berkeley Modeling Runtime & Energy 13

UC Berkeley A few examples of speedups Matrix multiplication -Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication QR decomposition (used in least squares, data mining, ) -Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10 -Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200 -Up to 13x on Tesla C2050 / Fermi, for 110k x 100 -Up to 4x on Grid of 4 cities (Dongarra, Langou et al) -infinite speedup for out-of-core on PowerPC laptop -LAPACK thrashed virtual memory, didnt finish Eigenvalues of band symmetric matrices -Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential) Iterative sparse linear equations solvers (GMRES) -Up to 4.3x on Intel Clovertown, 8 core N-body (direct particle interactions with cutoff distance) -Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs. 14

UC Berkeley Modeling Energy: Dynamic 15

UC Berkeley Modeling Energy: Memory Retention 16

UC Berkeley Modeling Energy: Background Power 17

UC Berkeley Energy Lower Bounds 18

UC Berkeley Early Result: Perfect Strong Scaling in Time and Energy Every time you add processor, use its memory M too Start with minimal number of procs: PM = 3n 2 Increase P by factor c total memory increases by factor c Notation for timing model: - t, t, t = secs per flop, per word_moved, per message of size m T(cP) = n 3 /(cP) [ T + t /M 1/2 + t /(mM 1/2 ) ] = T(P)/c Notation for energy model: - e, e, e = Joules for same operations - e = Joules per word of memory used per sec - e = Joules per sec for leakage, etc. E(cP) = cP { n 3 /(cP) [ e + e /M 1/2 + e /(mM 1/2 ) ] + e MT(cP) + ET(cP) } = E(P) Perfect scaling extends to n-body, Strassen, [IPDPS, 2013] 19

UC Berkeley C-A Algorithms Not Just for HPC In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, Initial results on lower bounds of database join algorithms 20

UC Berkeley From C-A Algorithms to Provably Optimal Systems? 1) Prove lower bounds on communication for a computation 2) Develop algorithm that achieves lower bound on a system 3) Find that communication time/energy cost is >90% of resulting implementation 4) We know were within 10% of optimal! Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs 21

UC Berkeley ESP: An Applications Processor Architecture for ASPIRE Future server and mobile SoCs will have many fixed- function accelerators and a general-purpose programmable multicore Well-known how to customize hardware engines for specific task ESP challenge is using specialized engines for general-purpose code 22 Intel Ivy Bridge (22nm) Qualcomm Snapdragon MSM8960 (28nm)

UC Berkeley ESP: Ensembles of Specialized Processors General-purpose hardware, flexible but inefficient Fixed-function hardware, efficient but inflexible Par Lab Insight: Patterns capture common operations across many applications, each with unique communication & computation structure Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs Bet: Will give us efficiency plus flexibility -Any given core can have a different mix of these depending on workload 23

UC Berkeley Par Lab: Motifs common across apps 24 DenseGraphSparse Audio Recognition Object Recognition Scene Analysis

UC Berkeley 25 Par Lab Apps Blue Cool Motif (nee Dwarf) Popularity (Red Hot / Blue Cool) Computing Domains

UC Berkeley Pipe-and-Filter Agent-and-Repository Event-based Bulk Synchronous Map-Reduce Layered Systems Model-view controller Arbitrary Task Graphs Puppeteer Model-View-Controller Application Graph Algorithms Dynamic programming Dense/Spare Linear Algebra Un/Structured Grids Graphical Models Finite State Machines Backtrack Branch-and-Bound N-Body Methods Circuits Spectral Methods Monte-Carlo Architecting Parallel Software Identify the Software Structure Identify the Key Computations

UC Berkeley Mapping Software to ESP: Specializers Capture desired functionality at high-level using patterns in a productive high-level language Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code ASP specializer infrastructure, open-source download 27 ILP Engine Dense Engine Sparse Engine Graph Engine Glue Code Dense Code Sparse Code Graph Code DenseGraphSparse Audio Recognition Object Recognition Scene Analysis

UC Berkeley Replacing Fixed Accelerators with Programmable Fabric Future server and mobile SoCs will have many fixed- function accelerators and a general-purpose programmable multicore Fabric challenge is retaining extreme energy efficiency while retaining programmability 28 Intel Ivy Bridge (22nm) Qualcomm Snapdragon MSM8960 (28nm)

UC Berkeley Strawman Fabric Architecture 29 M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R Will never have a C compiler Only programmed using pattern-based DSLs More dynamic, less static than earlier approaches Dynamic dataflow-driven execution Dynamic routing Large memory support

UC Berkeley Agile Hardware Development Current hardware design slow and arduous But now have huge design space to explore How to examine many design points efficiently? Build parameterized generators, not point designs! Adopt and adapt best practices from Agile Software -Complete LVS-DRC clean physical design of current version every ~ two weeks (tapein) -Incremental feature addition -Test & Verification first step 30

UC Berkeley Chisel: Constructing Hardware In a Scala Embedded Language Embed a hardware-description language in Scala, using Scalas extension facilities A hardware module is just a data structure in Scala Different output routines can generate different types of output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation Full power of Scala for writing hardware generators -Object-Oriented: Factory objects, traits, overloading etc -Functional: Higher-order funcs, anonymous funcs, currying -Compiles to JVM: Good performance, Java interoperability 31

UC Berkeley Chisel Design Flow 32 Chisel Program C++ code FPGA Verilog ASIC Verilog Software Simulator C++ Compiler Scala/JVM FPGA Emulation FPGA Tools GDS Layout ASIC Tools

UC Berkeley Chisel is much more than an HDL The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits) Only ~6,000 lines of code in current version including libraries! BSD-licensed open source at: chisel.eecs.berkeley.edu 33

UC Berkeley Many processor tapeouts in few years with small group (45nm, 28nm) 34 Clock test site SRAM test site DCDC test site Processor Site CORE 0 VC0 CORE 1 VC1 CORE 2 VC2 CORE 3 VC3 512KB L2 VFIXED Test Sites

UC Berkeley Resilient Circuits & Modeling Future scaled technologies have high variability but want to run with lowest-possible margins to save energy Significant increase in soft errors, need resilient systems Technology modeling to determine tradeoff between MTBF and energy per task for logic, SRAM, & interconnect. 35 Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors

UC Berkeley Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency 36 DenseGraphSparse C++ Simulation FPGA Emulation Audio Recognition Object Recognition Scene Analysis Hardware Cache Coherence ASIC SoC FPGA Computer Communication-Avoiding Algorithms C-A GEMM C-A BFS C-A SpMV Pipe&FilterMap-Reduce ILP Engine Dense Engine Sparse Engine Graph Engine Local Stores + DMA Glue Code Dense Code Sparse Code Graph Code

UC Berkeley ASPIRE Project Initial $15.6M/5.5 year funding from DARPA PERFECT program -Started 9/28/2012 -Located in Par Lab space + BWRC Looking for industrial affiliates (see Krste!) Open House today, 5 th floor Soda Hall 37 Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.

Documents

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach,