Exploring the capabilities of multi-core DSPs for dense linear … copy.pdf · 2012. 3. 16. ·...

DSPs: ubiquity, high performance, low power• Ubiquity: Present in millions of commodity devices.• High performance: Recently added floating-point capabilities to support 4G networks.• Low power: Promising 10 W per chip.

Why dense linear algebra (DLA)?• Representative of the potential performance of architectures.• Base for higher-level libraries and applications.

Related previous work• First attempt to port DLA to DSPs• libflame’s SuperMatrix previously ported to multi-core and multiGPU. DSP port as a proof of flexibility.• Same methodology, same codes for all architectures.

Option 1: Manual OpenMP parallelization

Pros: No runtime overhead.

Cons: Programmability. Manually parallelize all BLAS/LAPACK routines.

Introduction Retargeting to multi-core Power efficiency

Exploring the capabilities of multi-core DSPs for dense linear algebraFrancisco D. Igual

Texas Advanced Computing CenterMurtaza Ali

Texas Instruments

More information...

GEMM on a single core

Texas Instruments C66x

GotoBLAS approach• High performance implementation of BLAS on multiple architectures.• High-level multi-layer approach to take benefits of multiple cache levels.• We map this approach to the TI DSP at the single core level.

Conclusions

Core capabilities (up to 8 cores) Floating point capabilities:

• 16 flops/cycle (single precision).• SIMD support up to 128 bits.

Eight functional units, two register files, two data paths:• .M perform all multiply operations.• .S and .L perform general arithmetic, logical and branch.• .D perform load and store.• Two general-purpose register files with 32 32-bit registers.

Memory capabilities• 32 Kb of L1P and L1D cache.• 4096 Kb of L2 memory (512 Kb per core).

• Usable as SRAM, cached or mixed.• 4096 Kb of shared MSM (Multicore Shared Memory).

• Usable as SRAM or L2.• 64-bit DDR3 external memory interface at 1600 Mhz with ECC.• Full flexibility to view caches as pure scratchpad memories.

Programmability and multi-thread support• Native C/C++ code is supported by the TI C/C++ compiler.• Intrinsics and vector datatypes to improve performance.• Code Composer Studio as IDE.• Multi-thread support:• OpenMP 3.0.• No cache coherency between cores.

512 Kb

4096 Kb

L2 Cache/

CFG Switch

ended M

MSM SRAM

Switch Fabric

Fabric

Instruction Fetch

rrupt E

xceptio

Program Memory Controller (PMC)

C66x DSP core

16!/32!bit Instruction Dispatch

Control Registers

In!Circuit Emulation

Instruction Decode

Data Path A Data Path B

A Register File B Register File

A31 ! A16 B31 ! B16

B15 ! B0A15 ! A0

.L1 .S1 .M1 .D1 .L1 .S1 .M1 .D1

Data Memory Controller (DMC)

32 Kb L1P

32 Kb L1D

0 500 1000 1500 2000 2500 3000 3500 4000

Matrix size (m=n=k)

GEMM on TI C66x DSP - One core - 1Ghz

GEMM NNGEMM NTGEMM TTGEMM TN

0 500 1000 1500 2000 2500 3000 3500 4000

Matrix size (m=n=k)

GEMM on TI C66x DSP - Multiple cores - 1Ghz

8c4c2c1c

0 500 1000 1500 2000 2500 3000 3500 4000

Matrix size (m=n=k)

GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz

8c - Manual8c - SuperMatrix4c - SuperMatrix2c - SuperMatrix1c - SuperMatrix

Mapping GotoBLAS to the TI C66x DSP• TI DSP allows allocation of buffers in L1, L2, MSMC and DDR.• Use of DMA to overlap computation and transfer between memory layers.• Highly tuned inner kernel using vector intrinsics.

Option 2: Automatic runtime-based parallelization

libflame SuperMatrixPros: Programmability. Existing sequential DLA implementations parallelize automatically to the DSP.

Cons: Runtime overhead, especially for small matrices.

Architecture GFLOPS GFLOPS/Watt Utilization

Core i7-960 96 1.14 95%

Nvidia GTX280 410 2.6 66%

Cell 200 5.0 88%

Nvidia GTX480 940 5.4 70%

Stratix IV 200 7 90+%

TI C66x DSP 74 7.4 57%

Overlapping of DMA transfers to on-chip memories and computation in an iteration of a GEPP operation. Blocks in blue indicate ongoing DMA transfer.

Blocks in green indicate packed buffers.Blocks in red indicate actual computation.

Experimental results on one core Experimental results on multiple cores

Allocate packed buffers Apack and BpackPartition in the K dimension

for p=0:K-1 do Pack Bp into Bpack Partition in the M dimension

for i=0:M-1 do Pack and transpose Ai,p into Apack sgemmKernel(...)

end forend for

C A0 A1 A2

Apack BpackCi

A0,p BpackC0

• GotoBLAS ideas can be directly ported to single core DSP.• Memory hierarchy flexibility and DMA transfers.• Excellent performance per core (10 GFLOPS).• Scalability to 8 cores (74 GFLOPS).

• libflame + DSP.• One code, one library, multiple architectures.• Successfully ported to multicore DSPs.• No need of host: purely standalone low-power DLA solution.• Competitive asymptotic performance.• Excellent GFLOPS/Watt ratio.

GEPP GEPP GEBP GEBP

GEMM(General Matrix-Matrix Multiplication)

GEPP(General Panel-Panel Multiplication)

GEBP(General Block-Panel Multiplication)

Visit http://www.cs.utexas.edu/~flame

OpenMPCore0

FLAMEalgorithm

SequentialFLAME/C code

Hierarchicalmatrices

(storage by blocks)

DAGof tasks

SuperMatrix

From algorithms to high performance multi-core implementations

Runtime features:• Unlike previous solutions, no host is needed.• Runtime running exclusively in the DSP cores.• Shares infrastructure with multi-core and multiGPU libflame.

Same DLA codes target multiple architectures, including DSP.

Exploring the capabilities of multi-core DSPs for dense linear … copy.pdf · 2012. 3. 16. ·...

Documents

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

A supermatrix analysis of genomic, morphological, and

Rules on Handout. 8c Violation video 8c

MCB 372 #12: Tree, Quartets and Supermatrix Approaches

C66x KeyStone Training HyperLink. Agenda Overview Address Translation Configuration Performance Example

KeyStone C66x CorePac Instruction Set Architecture

C66x CorePac : Achieving High Performance

Efficient Real-Time Multicore Image Processing on TI C66x midterm presentation

KeyStone C66x CorePac Overview

KeyStone Training KeyStone C66x CorePac Instruction Set Architecture Rev1 Oct 2011

MMI Applications Team October 2011 KeyStone C66x Multicore SoC Overview

ENCADERNA - 8C

Chapter 8c

Aquarios 8c

Folkemusikkhefte 8c

An Algorithm-by-Blocks for SuperMatrix Band Cholesky ... · An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization Gregorio Quintana-Ort´ı 1, Enrique S. Quintana-Ort´ı

1 66AK2H14/12/06 Features and Description · 2015. 12. 18. · DSP Core Subsystems (C66x CorePacs), Each With – Up to 1.2 GHz C66x Fixed/Floating-Point DSP Cores › 38.4 GMacs/Core

C66x KeyStone Training HyperLink

An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization · 2009. 1. 8. · An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization 231 α 00 α 20 α 31 α 21

C66x CorePac: Achieving High Performance