Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram...

Codesign Tradeoffs for High-Performance,Low-Power Linear Algebra Architectures

Ardavan Pedram Robert van de Geijn Andreas Gerstlauer

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 2©Ardavan Pedram 2012

Outline

Trend of processors

• Technology scaling has reached physical limits– Limit of performance is power

• We may have Dark silicon on the chip– Only a percentage of chip might be active

Heterogeneous Solution

– Increase power efficiency: GFLOPS/W

– More of cores with lower frequency and power

– Specialized cores Orders of magnitude better

power efficiency (GFLOPS/W) Expensive Long time to market

04/11/23 5

Nvidia Tegra System on Chip

©Ardavan Pedram 2012

Linear Algebra Processor Design Goals

• Efficiency of full custom hardware • Orders of magnitude improvement

• Achieving upper limits of power/performance ratio

• Flexibility to execute a whole class of coarse- grain operations

• Co-optimized and co-designed across all layers

• Targeting linear algebra applications

04/11/23 6

Source: Andreas Olofsson

Linear Algebra Routines• Linear Algebra Package (LAPACK) level

– Cholesky and QR factorization

• Basic Linear Algebra Subroutines (BLAS)– General matrix-matrix multiplication

(GEMM)

• Inner kernels– Hand-optimized

• GEMM is often what delivers high-performance to many crucial applications

Outline

GEMM Implementations• CPUs: 95% peak

– [Goto et al.2008][Intel MKL]

• GPUs: 70% peak– [Nath et al.2010] Nvidia Fermi– [Volkov et al.2008] Nvidia Tesla

• FPGAs: 99% peak– [Zikari et al. 2007]– [Zhuo et al. 2008]

• Specialized architectures– Clearspeed CSX: 78% peak

– Systolic Arrays:• [Lippert et al.2001]

• Intel Quad core– 40 GFLOPS @2.6 GHz

• Nvidia FERMI– 350 GFLOPS @1.15 GHz

• Altera Stratix IV– 100 GFLOPS @ 0.4 GHz

• CSX 700– 75 GFLOPS @ 0.25 GHz

Common Sources of Inefficiencies in conventional architectures

• CPUs & GPUs– Instruction handling– Multi-ported register file– Cache overheads: tags and coherency– Thread scheduling

• FPGAs– Low area efficiency

• Specialized architectures– Data communication overheads

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Modeling• Generalization• Conclusion and Future Work

Matrix Multiplication Hierarchy

04/11/23

• Fastest general-purpose implementation of GEMM.[GotoBLAS]

©Ardavan Pedram 2012 12

Rank-1 Update• Rank-1 Update:

Updates a matrix by adding outer product of two vectors to it

04/11/23 13

Matrix multiplication using series of rank-1 updates:Let C, A, and B be 4x4, 4xkc, and kcx4 matrices. C+=AB can be computed as:

for i=0 to kc-1

end for

Linear Algebra Core (LAC) Desgin

• Customized for rank-1 update– 2D arrangement of PEs– Broadcast buses

• Integrates into memory hierarchy

CC04/11/23 15

On-Chip Memory

C += A0B0+ … + AK-1BK-1

MainMemory

Core Local stores

Memory Hierarchy

AA BBCC

04/11/23 16

On-Chip Memory

Ci += Ai,pBp

Core Local stores

Memory Hierarchy

CC AA BB

04/11/23 17

On-Chip Memory

Ci,j+= Ai,pBp,j

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

On-Chip Memory

04/11/23 18

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

Design of Linear Algebra Core (LAC)

• Distributed memory architecture• Broadcast Buses04/11/23 19©Ardavan Pedram 2012

Data Mapping on LAC

04/11/23 20

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs

Data Mapping on LAC

04/11/23 21

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs

Rank-1 Update

c11+=a1ixbi1c11+=a1ixbi1

04/11/23

Orange : elements of A Green : elements of B Blue : elements of C

22©Ardavan Pedram 2012

GEMM on LAP

23©Ardavan Pedram 2012

Multi LAC on Chip

• Same panel of B for all cores

• On-chip memory stores a complete n×n block of C

• Each core computes different panel of C

04/11/23 24

Lac 0Memory

Lac 1Memory

Lac 2Memory

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work

Performance and Power Analysis

• Analytical formulae– Utilization– Bandwidth– Size of local stores

• Cycle-accurate simulator– Matrix multiplication– Cholesky factorization

• Component selections

– MAC units (45nm) [Galal et al.2010]

– Storage model with [CACTI 6.0]• Pure SRAM Model

– Interconnect• AMBA AHB [Lahiri.2004]• [Wolkotte.2009]

– Activity of components based on GEMM

– Leakage as 25%~30% of dynamic power

Core Utilization Trade-off

04/11/23 27

• Bandwidth vs. local memory size trade-off

• 100% utilization

• Core dimension trade-off

Multi-LAC Solution Trade-off

04/11/23 28

• On-chip memory limits performance

• On-chip Bandwidth requirement grows exponentially to maintain peak performance

• 33 GB/s off-chip BW

• Over 600 DP-GFLOPS

• Over 90% utilization

Performance vs. External Bandwidth

04/11/23 29

256x256 /512x512 / 768x768 /1024x1024

PE Efficiency for Different Frequencies

• Area– Mostly occupied by SRAM

• Power– Mostly consumed by MAC

units• 120 GFLOPS/W

– upper limit for SP-PE• 60 GFLOPS/W

– upper limit for DP-PE• 1 GHz sweet spot of

performance vs. efficiency• Low voltages,

– SRAM power consumption limits efficiency

LAP vs. Intel® Core2 Duo Penryn

• Power Break down– [V George et al.2007]

• Out of Order and Frontend– 40% of the core

power (over 5 W)

• Execution logic– Register file

LAP vs. GTX280 Nvidia Tesla

• Single Precision GEMM04/11/23 32

LAP VS. GTX480 Nvidia Fermi

Summary of LAP

– 600/1200 DP/SP-GFLOPS– One/two Orders of magnitude Improvements vs. GPUs/CPUs

GEMM Performance and efficiency on different platforms

04/11/23 35

GFLOPS W/mm2 GFLOPS/mm2 GFLOPS/W Utilization

Cell BE (SP) 200 0.3 1.5 5 88%

NVidia GTX480 SM (SP) 780 0.2 0.9 5.2 70%

NVidia GTX480 SM (DP) 390 0.2 0.5 2.6 70%

Intel Core-i7 960 (SP) 96 0.4 0.5 1.2 95%

Intel Core-i7 960 (DP) 48 0.4 0.25 0.6 95%

Altera Stratix IV (DP) 100 0.02 0.05 3.5 90+%

ClearSpeed CSX700(DP)

75 0.02 0.2 12.5 78%

LAP (SP) 1200 0.2 6-11 55 90+%

LAP (DP) 600 0.2 3-5 25 90+%

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work

Conclusion

• Linear algebra Processor– Algorithm/Architecture co-design– Power and efficiency estimation– Generalized to more complex algorithms (Cholesky)

– Results @ 1GHz• DP: 32 GFLOPS, 47 GFLOPS/W• 0.6 Watts • 2.8 mm2 in 45nm• 4 GB/s external BW • Orders of magnitude improvement

Conclusion

• Studied Architectures and their power consumption sources

Future Work

• Implementation– Hardware synthesis

• Generalization– Level-3 BLAS– LU and QR

factorization

04/11/23 39

• Integration within a general purpose framework

• Design space exploration– Picking the right

algorithm variant

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram...

Documents

Re-Order Point Problems Set 1: General. 2 Ardavan Asef-Vaziri Sep-2012Flow Variability; Safety Inventory Notations

Ardavan Afshar B.Sc. Project

11-1 Bard, Tiwari, Telang, Janapa Reddi, Gerstlauer, Valvano, Yerraballi EE 319K Introduction to Embedded Systems Lecture 11: Data Acquisition, Numerical

Towards a High-Performance, Low-Power Linear Algebra Processor · Towards a High-Performance, Low-Power Linear Algebra Processor FLAME Working Note #49 Ardavan Pedram Andreas Gerstlauer

Memory Trends and Implications for Lithography …dsasymposium.org/images/proceedings/Keynote1_Somerville.pdf1 ©2015 Micron Technology, Inc. Linda K. Somerville, Ardavan Niroomand,

Coding FLAME Algorithms with FLAME@lab Example: dot product Robert van de Geijn Department of Computer Sciences, UT-Austin

Copyright by Ardavan Pedram 2013

4/16/2015 1 Ardavan Asef-Vaziri Variable of interest Time Series Analysis

Capacity Planning Break-Even Point Ardavan Asef-Vaziri Systems and Operations Management College of Business and Economics California State University,

Bob Coaster - Gerstlauer · 2018. 8. 30. · Bob Coaster Bob Coaster Gerstlauer Amusement Rides GmbH Industriestraße 17 86 505 Münsterhausen Germany Tel. (49) 8281 - 9968 0 Fax

EE382V: Embedded System Design and Modelingusers.ece.utexas.edu/~gerstl/ee382v_f10/notes/lecture1.pdf · EE382V: Embedded System Design and Modeling Andreas Gerstlauer Electrical

Ardavan Asef-Vaziri, SOM, COBAE, CSUN 2014 1 On the Student Academic Success Facet of Our Strategic Plan An Operations Management Perspective Ardavan Asef-Vaziri

Floating Point Architecture Extensions for Optimized ... Point Architecture Extensions for Optimized Matrix Factorization Ardavan Pedram, Andreas Gerstlauer Department of Electrical

Ardavan 2016... · 2017-08-09 · Respondent Ardavan Afrasiabi, M.D. ("Respondent") is represented in this 26 proceeding by attorney Peter Osinoff, whose address is: 3699 Wilshire

10-1 Bard, Gerstlauer, Valvano, Yerraballi EE 319K Introduction to Embedded Systems Lecture 10: Sampling, Analog-to-Digital Conversion

High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

E 034 036 Parkour - gerstlauer-rides.de · Buschmann, Frank Reubold from the Erbach city authorities responsible for the Markt, and fair- ... Even Hubert Gerstlauer came to the première

Slim ICT gebruik in de zorg! (Tony Hardenberg, Van de Geijn Partners)

EE382V: Embedded System Design and Modelingusers.ece.utexas.edu/~gerstl/ee382v_f08/notes/lecture5.pdf · Embedded System Design and Modeling Andreas Gerstlauer Electrical and Computer

Update jeugdverslavingszorg (Richard Derks, Van de Geijn Partners)