Transcript
Page 1: Exploring the capabilities of multi-core DSPs for dense linear … copy.pdf · 2012. 3. 16. · GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz 8c - Manual 8c - SuperMatrix

DSPs: ubiquity, high performance, low power• Ubiquity: Present in millions of commodity devices.• High performance: Recently added floating-point capabilities to support 4G networks.• Low power: Promising 10 W per chip.

Why dense linear algebra (DLA)?• Representative of the potential performance of architectures.• Base for higher-level libraries and applications.

Related previous work• First attempt to port DLA to DSPs• libflame’s SuperMatrix previously ported to multi-core and multiGPU. DSP port as a proof of flexibility.• Same methodology, same codes for all architectures.

Option 1: Manual OpenMP parallelization

Pros: No runtime overhead.

Cons: Programmability. Manually parallelize all BLAS/LAPACK routines.

Introduction Retargeting to multi-core Power efficiency

Exploring the capabilities of multi-core DSPs for dense linear algebraFrancisco D. Igual

Texas Advanced Computing CenterMurtaza Ali

Texas Instruments

More information...

GEMM on a single core

Texas Instruments C66x

GotoBLAS approach• High performance implementation of BLAS on multiple architectures.• High-level multi-layer approach to take benefits of multiple cache levels.• We map this approach to the TI DSP at the single core level.

Conclusions

Core capabilities (up to 8 cores) Floating point capabilities:

• 16 flops/cycle (single precision).• SIMD support up to 128 bits.

Eight functional units, two register files, two data paths:• .M perform all multiply operations.• .S and .L perform general arithmetic, logical and branch.• .D perform load and store.• Two general-purpose register files with 32 32-bit registers.

Memory capabilities• 32 Kb of L1P and L1D cache.• 4096 Kb of L2 memory (512 Kb per core).

• Usable as SRAM, cached or mixed.• 4096 Kb of shared MSM (Multicore Shared Memory).

• Usable as SRAM or L2.• 64-bit DDR3 external memory interface at 1600 Mhz with ECC.• Full flexibility to view caches as pure scratchpad memories.

Programmability and multi-thread support• Native C/C++ code is supported by the TI C/C++ compiler.• Intrinsics and vector datatypes to improve performance.• Code Composer Studio as IDE.• Multi-thread support:• OpenMP 3.0.• No cache coherency between cores.

512 Kb

4096 Kb

L2 Cache/

CFG Switch

SRAM

Contr

olle

r (E

MC

)E

xtern

al M

em

ory

Ext

ended M

em

ory

Contr

olle

r (X

MC

)U

nifi

ed M

em

ory

Contr

olle

r (U

MC

)

MSM SRAM

DDR3

SRAM

DMA

Switch Fabric

Fabric

Instruction Fetch

Inte

rrupt E

xceptio

n C

ontr

olle

r

Program Memory Controller (PMC)

C66x DSP core

16!/32!bit Instruction Dispatch

Control Registers

In!Circuit Emulation

Instruction Decode

Data Path A Data Path B

A Register File B Register File

A31 ! A16 B31 ! B16

B15 ! B0A15 ! A0

.L1 .S1 .M1 .D1 .L1 .S1 .M1 .D1

Data Memory Controller (DMC)

32 Kb L1P

32 Kb L1D

0

2

4

6

8

10

12

14

16

0 500 1000 1500 2000 2500 3000 3500 4000

GF

LO

PS

Matrix size (m=n=k)

GEMM on TI C66x DSP - One core - 1Ghz

GEMM NNGEMM NTGEMM TTGEMM TN

0

10

20

30

40

50

60

70

80

0 500 1000 1500 2000 2500 3000 3500 4000

GF

LO

PS

Matrix size (m=n=k)

GEMM on TI C66x DSP - Multiple cores - 1Ghz

8c4c2c1c

0

10

20

30

40

50

60

70

80

0 500 1000 1500 2000 2500 3000 3500 4000

GF

LO

PS

Matrix size (m=n=k)

GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz

8c - Manual8c - SuperMatrix4c - SuperMatrix2c - SuperMatrix1c - SuperMatrix

Mapping GotoBLAS to the TI C66x DSP• TI DSP allows allocation of buffers in L1, L2, MSMC and DDR.• Use of DMA to overlap computation and transfer between memory layers.• Highly tuned inner kernel using vector intrinsics.

Option 2: Automatic runtime-based parallelization

libflame SuperMatrixPros: Programmability. Existing sequential DLA implementations parallelize automatically to the DSP.

Cons: Runtime overhead, especially for small matrices.

Architecture GFLOPS GFLOPS/Watt Utilization

Core i7-960 96 1.14 95%

Nvidia GTX280 410 2.6 66%

Cell 200 5.0 88%

Nvidia GTX480 940 5.4 70%

Stratix IV 200 7 90+%

TI C66x DSP 74 7.4 57%

Overlapping of DMA transfers to on-chip memories and computation in an iteration of a GEPP operation. Blocks in blue indicate ongoing DMA transfer.

Blocks in green indicate packed buffers.Blocks in red indicate actual computation.

Experimental results on one core Experimental results on multiple cores

Allocate packed buffers Apack and BpackPartition in the K dimension

for p=0:K-1 do Pack Bp into Bpack Partition in the M dimension

for i=0:M-1 do Pack and transpose Ai,p into Apack sgemmKernel(...)

end forend for

C A0 A1 A2

B0

B1

B2

Apack BpackCi

A1,p

A2,p

A0,p BpackC0

C1

C2

• GotoBLAS ideas can be directly ported to single core DSP.• Memory hierarchy flexibility and DMA transfers.• Excellent performance per core (10 GFLOPS).• Scalability to 8 cores (74 GFLOPS).

• libflame + DSP.• One code, one library, multiple architectures.• Successfully ported to multicore DSPs.• No need of host: purely standalone low-power DLA solution.• Competitive asymptotic performance.• Excellent GFLOPS/Watt ratio.

L1

L2

MSMC

DDR

GEPP GEPP GEBP GEBP

GEMM(General Matrix-Matrix Multiplication)

GEPP(General Panel-Panel Multiplication)

GEBP(General Block-Panel Multiplication)

Visit http://www.cs.utexas.edu/~flame

OpenMPCore0

Core1

Core2

Core3

Core4

Core5

Core6

Core7

Core0

Core1

Core2

Core3

Core4

Core5

Core6

Core7

C A B

FLAMEalgorithm

SequentialFLAME/C code

Hierarchicalmatrices

(storage by blocks)

DAGof tasks

SuperMatrix

From algorithms to high performance multi-core implementations

Runtime features:• Unlike previous solutions, no host is needed.• Runtime running exclusively in the DSP cores.• Shares infrastructure with multi-core and multiGPU libflame.

Same DLA codes target multiple architectures, including DSP.