View
5
Download
0
Category
Preview:
Citation preview
DSPs: ubiquity, high performance, low power• Ubiquity: Present in millions of commodity devices.• High performance: Recently added floating-point capabilities to support 4G networks.• Low power: Promising 10 W per chip.
Why dense linear algebra (DLA)?• Representative of the potential performance of architectures.• Base for higher-level libraries and applications.
Related previous work• First attempt to port DLA to DSPs• libflame’s SuperMatrix previously ported to multi-core and multiGPU. DSP port as a proof of flexibility.• Same methodology, same codes for all architectures.
Option 1: Manual OpenMP parallelization
Pros: No runtime overhead.
Cons: Programmability. Manually parallelize all BLAS/LAPACK routines.
Introduction Retargeting to multi-core Power efficiency
Exploring the capabilities of multi-core DSPs for dense linear algebraFrancisco D. Igual
Texas Advanced Computing CenterMurtaza Ali
Texas Instruments
More information...
GEMM on a single core
Texas Instruments C66x
GotoBLAS approach• High performance implementation of BLAS on multiple architectures.• High-level multi-layer approach to take benefits of multiple cache levels.• We map this approach to the TI DSP at the single core level.
Conclusions
Core capabilities (up to 8 cores) Floating point capabilities:
• 16 flops/cycle (single precision).• SIMD support up to 128 bits.
Eight functional units, two register files, two data paths:• .M perform all multiply operations.• .S and .L perform general arithmetic, logical and branch.• .D perform load and store.• Two general-purpose register files with 32 32-bit registers.
Memory capabilities• 32 Kb of L1P and L1D cache.• 4096 Kb of L2 memory (512 Kb per core).
• Usable as SRAM, cached or mixed.• 4096 Kb of shared MSM (Multicore Shared Memory).
• Usable as SRAM or L2.• 64-bit DDR3 external memory interface at 1600 Mhz with ECC.• Full flexibility to view caches as pure scratchpad memories.
Programmability and multi-thread support• Native C/C++ code is supported by the TI C/C++ compiler.• Intrinsics and vector datatypes to improve performance.• Code Composer Studio as IDE.• Multi-thread support:• OpenMP 3.0.• No cache coherency between cores.
512 Kb
4096 Kb
L2 Cache/
CFG Switch
SRAM
Contr
olle
r (E
MC
)E
xtern
al M
em
ory
Ext
ended M
em
ory
Contr
olle
r (X
MC
)U
nifi
ed M
em
ory
Contr
olle
r (U
MC
)
MSM SRAM
DDR3
SRAM
DMA
Switch Fabric
Fabric
Instruction Fetch
Inte
rrupt E
xceptio
n C
ontr
olle
r
Program Memory Controller (PMC)
C66x DSP core
16!/32!bit Instruction Dispatch
Control Registers
In!Circuit Emulation
Instruction Decode
Data Path A Data Path B
A Register File B Register File
A31 ! A16 B31 ! B16
B15 ! B0A15 ! A0
.L1 .S1 .M1 .D1 .L1 .S1 .M1 .D1
Data Memory Controller (DMC)
32 Kb L1P
32 Kb L1D
0
2
4
6
8
10
12
14
16
0 500 1000 1500 2000 2500 3000 3500 4000
GF
LO
PS
Matrix size (m=n=k)
GEMM on TI C66x DSP - One core - 1Ghz
GEMM NNGEMM NTGEMM TTGEMM TN
0
10
20
30
40
50
60
70
80
0 500 1000 1500 2000 2500 3000 3500 4000
GF
LO
PS
Matrix size (m=n=k)
GEMM on TI C66x DSP - Multiple cores - 1Ghz
8c4c2c1c
0
10
20
30
40
50
60
70
80
0 500 1000 1500 2000 2500 3000 3500 4000
GF
LO
PS
Matrix size (m=n=k)
GEMM on TI C66x DSP - Multiple cores using SuperMatrix - 1Ghz
8c - Manual8c - SuperMatrix4c - SuperMatrix2c - SuperMatrix1c - SuperMatrix
Mapping GotoBLAS to the TI C66x DSP• TI DSP allows allocation of buffers in L1, L2, MSMC and DDR.• Use of DMA to overlap computation and transfer between memory layers.• Highly tuned inner kernel using vector intrinsics.
Option 2: Automatic runtime-based parallelization
libflame SuperMatrixPros: Programmability. Existing sequential DLA implementations parallelize automatically to the DSP.
Cons: Runtime overhead, especially for small matrices.
Architecture GFLOPS GFLOPS/Watt Utilization
Core i7-960 96 1.14 95%
Nvidia GTX280 410 2.6 66%
Cell 200 5.0 88%
Nvidia GTX480 940 5.4 70%
Stratix IV 200 7 90+%
TI C66x DSP 74 7.4 57%
Overlapping of DMA transfers to on-chip memories and computation in an iteration of a GEPP operation. Blocks in blue indicate ongoing DMA transfer.
Blocks in green indicate packed buffers.Blocks in red indicate actual computation.
Experimental results on one core Experimental results on multiple cores
Allocate packed buffers Apack and BpackPartition in the K dimension
for p=0:K-1 do Pack Bp into Bpack Partition in the M dimension
for i=0:M-1 do Pack and transpose Ai,p into Apack sgemmKernel(...)
end forend for
C A0 A1 A2
B0
B1
B2
Apack BpackCi
A1,p
A2,p
A0,p BpackC0
C1
C2
• GotoBLAS ideas can be directly ported to single core DSP.• Memory hierarchy flexibility and DMA transfers.• Excellent performance per core (10 GFLOPS).• Scalability to 8 cores (74 GFLOPS).
• libflame + DSP.• One code, one library, multiple architectures.• Successfully ported to multicore DSPs.• No need of host: purely standalone low-power DLA solution.• Competitive asymptotic performance.• Excellent GFLOPS/Watt ratio.
L1
L2
MSMC
DDR
GEPP GEPP GEBP GEBP
GEMM(General Matrix-Matrix Multiplication)
GEPP(General Panel-Panel Multiplication)
GEBP(General Block-Panel Multiplication)
Visit http://www.cs.utexas.edu/~flame
OpenMPCore0
Core1
Core2
Core3
Core4
Core5
Core6
Core7
Core0
Core1
Core2
Core3
Core4
Core5
Core6
Core7
C A B
FLAMEalgorithm
SequentialFLAME/C code
Hierarchicalmatrices
(storage by blocks)
DAGof tasks
SuperMatrix
From algorithms to high performance multi-core implementations
Runtime features:• Unlike previous solutions, no host is needed.• Runtime running exclusively in the DSP cores.• Shares infrastructure with multi-core and multiGPU libflame.
Same DLA codes target multiple architectures, including DSP.
Recommended