1
DSPs: ubiquity, high performance, low power Ubiquity: Present in millions of commodity devices. High performance: Recently added floating-point capabilities to support 4G networks. Low power: Promising 10 W per chip. Why dense linear algebra (DLA)? Representative of the potential performance of architectures. Base for higher-level libraries and applications. Related previous work First attempt to port DLA to DSPs libflame’s SuperMatrix previously ported to multi-core and multiGPU. DSP port as a proof of flexibility. Same methodology, same codes for all architectures. Option 1: Manual OpenMP parallelization Pros: No runtime overhead. Cons: Programmability. Manually parallelize all BLAS/LAPACK routines. Introduction Retargeting to multi-core Power efficiency Exploring the capabilities of multi-core DSPs for dense linear algebra Francisco D. Igual Texas Advanced Computing Center Murtaza Ali Texas Instruments More information... GEMM on a single core Texas Instruments C66x GotoBLAS approach High performance implementation of BLAS on multiple architectures. High-level multi-layer approach to take benefits of multiple cache levels. We map this approach to the TI DSP at the single core level. Conclusions Core capabilities (up to 8 cores) Floating point capabilities: 16 flops/cycle (single precision). SIMD support up to 128 bits. Eight functional units, two register files, two data paths: .M perform all multiply operations. .S and .L perform general arithmetic, logical and branch. .D perform load and store. Two general-purpose register files with 32 32-bit registers. Memory capabilities 32 Kb of L1P and L1D cache. 4096 Kb of L2 memory (512 Kb per core). Usable as SRAM, cached or mixed. 4096 Kb of shared MSM (Multicore Shared Memory). Usable as SRAM or L2. 64-bit DDR3 external memory interface at 1600 Mhz with ECC. Full flexibility to view caches as pure scratchpad memories. Programmability and multi-thread support Native C/C++ code is supported by the TI C/C++ compiler. Intrinsics and vector datatypes to improve performance. Code Composer Studio as IDE. Multi-thread support: OpenMP 3.0. No cache coherency between cores. 512 Kb 4096 Kb L2 Cache/ CFG Switch SRAM Controller (EMC) External Memory Extended Memory Controller (XMC) Unified Memory Controller (UMC) MSM SRAM DDR3 SRAM DMA Switch Fabric Fabric Instruction Fetch Interrupt Exception Controller Program Memory Controller (PMC) C66x DSP core 16-/32-bit Instruction Dispatch Control Registers In-Circuit Emulation Instruction Decode Data Path A Data Path B A Register File B Register File A31 - A16 B31 - B16 B15 - B0 A15 - A0 .L1 .S1 .M1 .D1 .L1 .S1 .M1 .D1 Data Memory Controller (DMC) 32 Kb L1P 32 Kb L1D 0 2 4 6 8 10 12 14 16 0 500 1000 1500 2000 2500 3000 3500 4000 GFLOPS Matrix size (m=n=k) GEMM on TI C66x DSP - One core - 1Ghz GEMM NN GEMM NT GEMM TT GEMM TN 0 10 20 30 40 50 60 70 80 0 500 1000 1500 2000 2500 3000 3500 4000 GFLOPS Matrix size (m=n=k) GEMM on TI C66x DSP - Multiple cores - 1Ghz 8c 4c 2c 1c 0 10 20 30 40 50 60 70 80 0 500 1000 1500 2000 2500 3000 3500 4000 GFLOPS Matrix size (m=n=k) GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz 8c - Manual 8c - SuperMatrix 4c - SuperMatrix 2c - SuperMatrix 1c - SuperMatrix Mapping GotoBLAS to the TI C66x DSP TI DSP allows allocation of buffers in L1, L2, MSMC and DDR. Use of DMA to overlap computation and transfer between memory layers. Highly tuned inner kernel using vector intrinsics. Option 2: Automatic runtime-based parallelization libflame SuperMatrix Pros: Programmability. Existing sequential DLA implementations parallelize automatically to the DSP. Cons: Runtime overhead, especially for small matrices. Architecture GFLOPS GFLOPS/Watt Utilization Core i7-960 96 1.14 95% Nvidia GTX280 410 2.6 66% Cell 200 5.0 88% Nvidia GTX480 940 5.4 70% Stratix IV 200 7 90+% TI C66x DSP 74 7.4 57% Overlapping of DMA transfers to on-chip memories and computation in an iteration of a GEPP operation. Blocks in blue indicate ongoing DMA transfer. Blocks in green indicate packed buffers. Blocks in red indicate actual computation. Experimental results on one core Experimental results on multiple cores Allocate packed buffers A pack and B pack Partition in the K dimension for p=0:K-1 do Pack B p into B pack Partition in the M dimension for i=0:M-1 do Pack and transpose A i,p into A pack sgemmKernel(...) end for end for C A 0 A 1 A 2 B 0 B 1 B 2 Apack B pack C i A1,p A2,p A0,p B pack C 0 C 1 C 2 GotoBLAS ideas can be directly ported to single core DSP. Memory hierarchy flexibility and DMA transfers. Excellent performance per core (10 GFLOPS). Scalability to 8 cores (74 GFLOPS). libflame + DSP. One code, one library, multiple architectures. Successfully ported to multicore DSPs. No need of host: purely standalone low-power DLA solution. Competitive asymptotic performance. Excellent GFLOPS/Watt ratio. L1 L2 MSMC DDR GEPP GEPP GEBP GEBP GEMM (General Matrix-Matrix Multiplication) GEPP (General Panel-Panel Multiplication) GEBP (General Block-Panel Multiplication) Visit http://www.cs.utexas.edu/~flame OpenMP Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 C A B FLAME algorithm Sequential FLAME/C code Hierarchical matrices (storage by blocks) DAG of tasks SuperMatrix From algorithms to high performance multi-core implementations Runtime features: Unlike previous solutions, no host is needed. Runtime running exclusively in the DSP cores. Shares infrastructure with multi-core and multiGPU libflame. Same DLA codes target multiple architectures, including DSP.

Exploring the capabilities of multi-core DSPs for dense linear … copy.pdf · 2012. 3. 16. · GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz 8c - Manual 8c - SuperMatrix

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring the capabilities of multi-core DSPs for dense linear … copy.pdf · 2012. 3. 16. · GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz 8c - Manual 8c - SuperMatrix

DSPs: ubiquity, high performance, low power• Ubiquity: Present in millions of commodity devices.• High performance: Recently added floating-point capabilities to support 4G networks.• Low power: Promising 10 W per chip.

Why dense linear algebra (DLA)?• Representative of the potential performance of architectures.• Base for higher-level libraries and applications.

Related previous work• First attempt to port DLA to DSPs• libflame’s SuperMatrix previously ported to multi-core and multiGPU. DSP port as a proof of flexibility.• Same methodology, same codes for all architectures.

Option 1: Manual OpenMP parallelization

Pros: No runtime overhead.

Cons: Programmability. Manually parallelize all BLAS/LAPACK routines.

Introduction Retargeting to multi-core Power efficiency

Exploring the capabilities of multi-core DSPs for dense linear algebraFrancisco D. Igual

Texas Advanced Computing CenterMurtaza Ali

Texas Instruments

More information...

GEMM on a single core

Texas Instruments C66x

GotoBLAS approach• High performance implementation of BLAS on multiple architectures.• High-level multi-layer approach to take benefits of multiple cache levels.• We map this approach to the TI DSP at the single core level.

Conclusions

Core capabilities (up to 8 cores) Floating point capabilities:

• 16 flops/cycle (single precision).• SIMD support up to 128 bits.

Eight functional units, two register files, two data paths:• .M perform all multiply operations.• .S and .L perform general arithmetic, logical and branch.• .D perform load and store.• Two general-purpose register files with 32 32-bit registers.

Memory capabilities• 32 Kb of L1P and L1D cache.• 4096 Kb of L2 memory (512 Kb per core).

• Usable as SRAM, cached or mixed.• 4096 Kb of shared MSM (Multicore Shared Memory).

• Usable as SRAM or L2.• 64-bit DDR3 external memory interface at 1600 Mhz with ECC.• Full flexibility to view caches as pure scratchpad memories.

Programmability and multi-thread support• Native C/C++ code is supported by the TI C/C++ compiler.• Intrinsics and vector datatypes to improve performance.• Code Composer Studio as IDE.• Multi-thread support:• OpenMP 3.0.• No cache coherency between cores.

512 Kb

4096 Kb

L2 Cache/

CFG Switch

SRAM

Contr

olle

r (E

MC

)E

xtern

al M

em

ory

Ext

ended M

em

ory

Contr

olle

r (X

MC

)U

nifi

ed M

em

ory

Contr

olle

r (U

MC

)

MSM SRAM

DDR3

SRAM

DMA

Switch Fabric

Fabric

Instruction Fetch

Inte

rrupt E

xceptio

n C

ontr

olle

r

Program Memory Controller (PMC)

C66x DSP core

16!/32!bit Instruction Dispatch

Control Registers

In!Circuit Emulation

Instruction Decode

Data Path A Data Path B

A Register File B Register File

A31 ! A16 B31 ! B16

B15 ! B0A15 ! A0

.L1 .S1 .M1 .D1 .L1 .S1 .M1 .D1

Data Memory Controller (DMC)

32 Kb L1P

32 Kb L1D

0

2

4

6

8

10

12

14

16

0 500 1000 1500 2000 2500 3000 3500 4000

GF

LO

PS

Matrix size (m=n=k)

GEMM on TI C66x DSP - One core - 1Ghz

GEMM NNGEMM NTGEMM TTGEMM TN

0

10

20

30

40

50

60

70

80

0 500 1000 1500 2000 2500 3000 3500 4000

GF

LO

PS

Matrix size (m=n=k)

GEMM on TI C66x DSP - Multiple cores - 1Ghz

8c4c2c1c

0

10

20

30

40

50

60

70

80

0 500 1000 1500 2000 2500 3000 3500 4000

GF

LO

PS

Matrix size (m=n=k)

GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz

8c - Manual8c - SuperMatrix4c - SuperMatrix2c - SuperMatrix1c - SuperMatrix

Mapping GotoBLAS to the TI C66x DSP• TI DSP allows allocation of buffers in L1, L2, MSMC and DDR.• Use of DMA to overlap computation and transfer between memory layers.• Highly tuned inner kernel using vector intrinsics.

Option 2: Automatic runtime-based parallelization

libflame SuperMatrixPros: Programmability. Existing sequential DLA implementations parallelize automatically to the DSP.

Cons: Runtime overhead, especially for small matrices.

Architecture GFLOPS GFLOPS/Watt Utilization

Core i7-960 96 1.14 95%

Nvidia GTX280 410 2.6 66%

Cell 200 5.0 88%

Nvidia GTX480 940 5.4 70%

Stratix IV 200 7 90+%

TI C66x DSP 74 7.4 57%

Overlapping of DMA transfers to on-chip memories and computation in an iteration of a GEPP operation. Blocks in blue indicate ongoing DMA transfer.

Blocks in green indicate packed buffers.Blocks in red indicate actual computation.

Experimental results on one core Experimental results on multiple cores

Allocate packed buffers Apack and BpackPartition in the K dimension

for p=0:K-1 do Pack Bp into Bpack Partition in the M dimension

for i=0:M-1 do Pack and transpose Ai,p into Apack sgemmKernel(...)

end forend for

C A0 A1 A2

B0

B1

B2

Apack BpackCi

A1,p

A2,p

A0,p BpackC0

C1

C2

• GotoBLAS ideas can be directly ported to single core DSP.• Memory hierarchy flexibility and DMA transfers.• Excellent performance per core (10 GFLOPS).• Scalability to 8 cores (74 GFLOPS).

• libflame + DSP.• One code, one library, multiple architectures.• Successfully ported to multicore DSPs.• No need of host: purely standalone low-power DLA solution.• Competitive asymptotic performance.• Excellent GFLOPS/Watt ratio.

L1

L2

MSMC

DDR

GEPP GEPP GEBP GEBP

GEMM(General Matrix-Matrix Multiplication)

GEPP(General Panel-Panel Multiplication)

GEBP(General Block-Panel Multiplication)

Visit http://www.cs.utexas.edu/~flame

OpenMPCore0

Core1

Core2

Core3

Core4

Core5

Core6

Core7

Core0

Core1

Core2

Core3

Core4

Core5

Core6

Core7

C A B

FLAMEalgorithm

SequentialFLAME/C code

Hierarchicalmatrices

(storage by blocks)

DAGof tasks

SuperMatrix

From algorithms to high performance multi-core implementations

Runtime features:• Unlike previous solutions, no host is needed.• Runtime running exclusively in the DSP cores.• Shares infrastructure with multi-core and multiGPU libflame.

Same DLA codes target multiple architectures, including DSP.