Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
IPDPS/HIPS’13
TSHMEM: Shared-Memory
Parallel Computing on Tilera
Many-Core Processors
May 20, 2013
Bryant C. Lam (speaker)
Alan D. George
Herman Lam
NSF Center for High-Performance Reconfigurable Computing (CHREC), University of Florida
Motivation
Emergent many-core devices vary considerably in architecture and performance characteristics e.g., scalable mesh networks, ring-bus topologies
Platforms provide separate APIs to individually leverage underlying hardware resources Difficult to develop efficient cross-platform apps
Objective Bridge the gap between many-core hardware
utilization and developer productivity
Approach Quantify performance of many-core devices via
application kernels and microbenchmarks
Develop TSHMEM to increase device utilization of various many-core platforms with arch-specific insights
2
Outline
Tilera Many-Core Chip Architectures
Microbenchmarks Shared Memory Copy
UDN Latency
TMC Spin/Sync Barriers
TSHMEM Computing Design Infrastructure
Performance Analysis Put/Get Transfers
Barrier Synchronization
Collectives
Application Case Studies
Conclusions, Q&A
3
Tilera TILE-Gx Architecture
64-bit VLIW processors
32k L1i cache, 32k L1d cache
256k L2 cache per tile
Up to 750 BOPS
Up to 200 Tbps of on-chip mesh
interconnect
Over 500 Gbps memory bandwidth
1 to 1.5 GHz operating frequency
Power consumption: 10 to 55W for
typical applications
2-4 DDR3 memory controllers
mPIPE delivers wire-speed packet
classification, processing, distribution
MiCA for cryptographic acceleration
VLIW: very long instruction word
BOPS: billion operations per second
mPIPE: multicore Programmable Intelligent Packet Engine
MiCA: Multicore iMesh Coprocessing Accelerator
TILE-Gx36 vs. TILEPro64
TILE-Gx8036 64-bit VLIW processors
32k L1i cache, 32k L1d cache
256k L2 cache per tile
Up to 750 billion op/sec
Up to 60 Tbps of on-chip mesh
interconnect
Over 500 Gbps memory bandwidth
1 to 1.5 GHz operating frequency
Power consumption: 10 to 55W for
typical applications
Two DDR3 memory controllers
TILEPro64 32-bit VLIW processors
16k L1i cache, 8k L1d cache
64k L2 cache per tile
Up to 443 billion op/sec
Up to 37 Tbps of on-chip
interconnect
Up to 50 Gbps of I/O bandwidth
700 MHz and 866 MHz operating
frequency
Power consumption: 19 – 23W
@700MHz under full load
Four DDR2 memory controllers
5
iMesh On-Chip Network Fabric
TILEPro64: MDN (Memory), CDN (Coherence), IDN (I/O),
TDN (Tile), UDN (User), STN (Static Network)
TILE-Gx36: Five networks instead of six (removed static network)
MDN replaced by QDN (reQuest) and RDN (Response) dynamic network
Memory controller has two QDN (0/1) and RDN (0/1) network connections
Reduced congestion in memory accesses from TILEPro
SDN (Share)
Provides accesses and coherency for cache system
IDN (Internal)
Used primarily for communication with external I/O
UDN (User)
Allows user applications to send data directly between tiles
UDN and IDN can be accessed by users
6
TILE-Gx36 Benchmarking
Why microbenchmark?
Determines empirically realizable TILE-Gx performance
Defines realistic upper bound for TSHMEM performance
Benchmarking Overview
Bandwidth of shared memory copy
UDN latency varying test pairs of
sender/receiver tiles
TMC spin/sync barrier primitives
7
L2 Cache
Shared-Memory-Copy Bandwidth
Bandwidth on iMesh networks to caches and memory controllers Shared memory performance
critical for TSHMEM
Bandwidth of memory operations influenced by 3 of 5 iMesh networks QDN: memory request network
RDN: memory response network
SDN: cache sharing network
Significant bandwidth performance transitions occur at cache-size limits L1 data cache: 3100 MB/s
L2 cache: 2700 MB/s
Memory-to-memory performance thereafter
8
L1 DCache
32k L1i cache
32k L1d cache
256k L2 cache per tile
0
500
1000
1500
2000
2500
3000
3500
4 B
8 B
16
B
32
B
64
B
12
8 B
25
6 B
51
2 B
1 k
B
2 k
B
4 k
B
8 k
B
16
kB
32
kB
64
kB
12
8 k
B
25
6 k
B
51
2 k
B
1 M
B
2 M
B
4 M
B
8 M
B
16
MB
32
MB
64
MB
Ban
dw
idth
(M
B/s
)
Transfer Size
Shared Memory Copy TILE-Gx36: heap to sharedTILE-Gx36: shared to heapTILE-Gx36: shared to sharedTILEPro64: heap to sharedTILEPro64: shared to heapTILEPro64: shared to shared
UDN Performance on Gx36
Tilera User Dynamic
Network (UDN)
Hardware support for routing
of data packets between CPU
cores (tiles)
UDN microbenchmark
Measures average latency
between two tiles by ping-
ponging numerous packets
9
UDN benchmarks transfer between two tiles
Red for neighbors
Green for side to side
Blue for corner to corner
UDN Latency
Average latency of one-
way UDN transfers
TILEPro64’s slightly faster
latency is attributed to 32-
bit switching fabric
TILE-Gx uses 64-bit fabric;
yields roughly twice as much
data throughput over TILEPro
Low latency of UDN
attractive for fast inter-tile
communication
10
0
5
10
15
20
25
30
35
Neighbors Side-to-Side Corners
Late
ncy
(n
s)
Tile-to-Tile in 6x6 Area
UDN One-Way Average Latency
TILE-Gx36
TILEPro64
Type Direction Sender Receiver
Time (ns)
(6x6 area) TILE-Gx36 TILEPro64
Neighbors
left 14 13 21 19
right 14 15 22 19
up 14 8 22 18
down 14 20 22 18
left 28 27 21 19
right 28 29 22 19
up 28 22 22 18
down 28 34 22 18
Side-to-Side
right 6 11 26 25
left 11 6 25 25
down 1 31 26 24
up 31 1 26 24
right 23 18 25 25
left 18 23 26 25
down 33 3 26 24
up 3 33 26 24
Corners
down-right 0 35 32 33
up-left 35 0 31 33
down-left 5 30 31 33
up-right 30 5 32 33
0 1 4 5
6 11
24 29
30 31 34 35
TMC Spin/Sync Barriers
Barrier primitives provided by Tilera’s TMC library Spin uses spinlock,
discouraging process context switching Sync barrier rectifies this with
performance penalty
Barrier performance on TILEPro is order of magnitude slower than TILE-Gx TILEPro barriers potentially too
slow for use in SHMEM
11
0.1
1
10
100
1000
0 10 20 30 40
Late
ncy
(µ
s)
Number of Tiles
TMC Barrier Performance
TMC spin, TILE-Gx36 TMC spin, TILEPro64
TMC sync, TILE-Gx36 TMC sync, TILEPro64
OpenSHMEM and TSHMEM
TSHMEM Overview
Investigate SHMEM reference design directly over Tilera TILE-Gx architecture and libraries
Stay true to SHMEM principles High performance with low overhead
Portability with easy programmability
Maximize architectural benefits Tile interconnect, mPIPE, MiCA
Extend design to multi-device systems Evaluate interconnect capabilities Explore optimizations to point-to-point
transfers and collectives
12
HPC acceleration with SHMEM on many-core processors
Tilera Libraries(TMC, gxio, etc.)
TSHMEM with OpenSHMEM API
Setup
Device Functionality
Data Transfers
Sync
TSHMEM reference design on TILE-Gx36
Modular design
utilizing vendor
libraries
• Dynamic symmetric heap management
• Point-to-point data transfer
• Point-to-point synchronization
• Barrier synchronization
• Broadcast, Collection, Reduction
• Atomics for dynamic variables
• Extension to multiple many-core devices
Achieved
• Optimizations for multiple many-core
• Exploration of new SHMEM extensions
Ongoing
TSHMEM Initialization
Running TSHMEM applications Compile C or C++ program
with TSHMEM library tile-gcc provided as cross compiler for TILE-Gx
Transfer and run program using tile-monitor and shmemrun executable launcher
TSHMEM setup and initialization – shmemrun and start_pes() Initializes TMC Common Memory (CMEM) for shared
memory allocation
Initializes UDN for barrier sync and basic inter-tile communications
Creates processes and binds them to unique tiles
Allocates shared memory and broadcasts pointer
Synchronizes PEs before exiting routine
13
C or C++ SHMEM program
tile-gcc TSHMEM executable
tile-monitor shmemrun
Put/Get Transfers – Design
14
Template functions generated with macro definitions Dynamic symmetric
variable transfers Performed via memcpy
and shmem_ptr into TMC common memory
Point-to-point transfers for static symmetric variables Handled via UDN
interrupts and dynamic shared memory buffer allocation
SHMEM Get
Put/Get Transfers – Performance
15
TSHMEM dynamic performance closely matches
shared-to-shared performance profile for both
TILE-Gx36 and TILEPro64 devices
0
500
1000
1500
2000
2500
3000
35004
B
8 B
16
B
32
B
64
B
12
8 B
25
6 B
51
2 B
1 k
B
2 k
B
4 k
B
8 k
B
16
kB
32
kB
64
kB
12
8 k
B
25
6 k
B
51
2 k
B
1 M
B
2 M
B
4 M
B
Effe
ctiv
e B
and
wid
th (
MB
/s)
Transfer Size
TSHMEM Put/Get Performance
TILE-Gx36 dynamic put
TILE-Gx36 dynamic get
TILE-Gx36 static put
TILE-Gx36 static get
TILEPro64 dynamic put
TILEPro64 dynamic get
Static Put/Get Transfers – Perf.
16
TSHMEM static transfers attempt to offload request
if possible, otherwise a temporary buffer is
necessary and drastically degrades performance
0
500
1000
1500
2000
2500
3000
35004
B
8 B
16
B
32
B
64
B
12
8 B
25
6 B
51
2 B
1 k
B
2 k
B
4 k
B
8 k
B
16
kB
32
kB
64
kB
12
8 k
B
25
6 k
B
51
2 k
B
1 M
B
2 M
B
4 M
B
Effe
ctiv
e B
and
wid
th (
MB
/s)
Transfer Size
TSHMEM Put/Get Performance on TILE-Gx36
TILE-Gx36 dynamic putTILE-Gx36 dynamic getTILE-Gx36 dynamic-static putTILE-Gx36 dynamic-static getTILE-Gx36 static-dynamic putTILE-Gx36 static-dynamic getTILE-Gx36 static putTILE-Gx36 static get
Barrier Sync. – Design
Barrier uses UDN for messages SHMEM allows for barriers with overlapping active sets
Difficult to enable overlapping active sets with TMC barriers
Barrier primitives for TILEPro too slow
Barrier design leverages split-phase barrier internally but externally appears as blocking Semantics for shmem_barrier is blocking
Cores (tiles) linearly forward notify message until all receive it
Active-set information transmitted to avoid overlap errors from multiple barriers
Lead tile broadcasts release message to leave barrier routine
Significant performance improvements with increasing number of tiles on TILE-Gx
1.5 µs latency at 36 tiles
17
Performance – Barrier Sync
18
TSHMEM barriers leverage UDN for better scaling than
most Tilera TMC barriers for TILE-Gx36 and TILEPro64
0.1
1
10
100
1000
0 10 20 30 40
Late
ncy
(µ
s)
Number of Tiles
TMC Barrier Performance
TMC spin, TILE-Gx36 TMC spin, TILEPro64
TMC sync, TILE-Gx36 TMC sync, TILEPro64
0
0.5
1
1.5
2
2.5
3
3.5
0 5 10 15 20 25 30 35 40
Late
ncy
(µ
s)
Number of Tiles
TSHMEM Barrier Performance
TSHMEM, TILE-Gx36 (best)
TSHMEM, TILE-Gx36 (worst)
TSHMEM, TILEPro64
TMC spin, TILE-Gx36
Collectives – Types
Broadcast One-to-all transfer
Active-set tiles except root tile receive data from root tile’s memory
Collection All-to-all transfer
Active-set tiles linearly concatenate their data
Reduction All-to-all transfer
Active-set tiles perform reduction arithmetic operation on their data
19
0 1 2 3 4 5 6 7
Performance – Broadcast
Pull-based broadcast up to 46 GB/s aggregate bandwidth
at 29 tiles; 37 GB/s aggregate at 36 tiles
20
28
1420
2632
0
500
1000
1500
2000
2500
3000
3500
4 B
8 B
16
B3
2 B
64
B1
28
B2
56
B5
12
B1
kB
2 k
B4
kB
8 k
B1
6 k
B3
2 k
B6
4 k
B1
28
kB
25
6 k
B5
12
kB
1 M
B2
MB
4 M
B
Nu
mb
er
of
Tile
s
Agg
rega
te E
ffe
ctiv
e B
and
wid
th (
MB
/s)
Transfer Size
Push-based Broadcast on TILE-Gx36
3000-3500
2500-3000
2000-2500
1500-2000
1000-1500
500-1000
0-500
28
1420
2632
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
4 B
8 B
16
B3
2 B
64
B1
28
B2
56
B5
12
B1
kB
2 k
B4
kB
8 k
B1
6 k
B3
2 k
B6
4 k
B1
28
kB
25
6 k
B5
12
kB
1 M
B2
MB
4 M
B
Nu
mb
er
of
Tile
s
Agg
rega
te E
ffe
ctiv
e B
and
wid
th (
MB
/s)
Transfer Size
Pull-based Broadcast on TILE-Gx36
45000-5000040000-4500035000-4000030000-3500025000-3000020000-2500015000-2000010000-150005000-100000-5000
Performance – FCollect, Reduction
Results are for naïve fast-collect and reduction;
additional algorithms planned for testing
21
27
1217
2227
32
0
100
200
300
400
500
600
700
800
900
1000
4 B
8 B
16
B3
2 B
64
B1
28
B2
56
B5
12
B1
kB
2 k
B4
kB
8 k
B1
6 k
B3
2 k
B6
4 k
B1
28
kB
25
6 k
B5
12
kB
1 M
B
Nu
mb
er
of
Tile
s
Agg
rega
te E
ffe
ctiv
e B
and
wid
th (
MB
/s)
Transfer Size
Fast-Collect on TILE-Gx36
900-1000800-900700-800600-700500-600400-500300-400200-300100-2000-100
27
1217
2227
32
0
50
100
150
200
250
300
350
400
450
4 B
8 B
16
B3
2 B
64
B1
28
B2
56
B5
12
B1
kB
2 k
B4
kB
8 k
B1
6 k
B3
2 k
B6
4 k
B1
28
kB
25
6 k
B5
12
kB
1 M
B
Nu
mb
er
of
Tile
s
Agg
rega
te E
ffe
ctiv
e B
and
wid
th (
MB
/s)
Transfer Size
Reduction on TILE-Gx36
400-450350-400300-350250-300200-250150-200100-15050-1000-50
App Case – 2D-FFT
TILE-Gx executes order of magnitude faster than
TILEPro due to improved floating-point support
Speedup at 32
TILE-Gx: 5
TILEPro: 12
22
0
2
4
6
8
10
12
14
0.1
1
10
1 2 4 8 16 32
Spe
ed
up
Exe
cuti
on
Tim
e (
s)
Number of Tiles
2D-FFT Performance with TSHMEM
TILE-Gx36 TILEPro64 TILE-Gx36 Speedup TILEPro64 Speedup
App Case – CBIR
Content-based image retrieval with searches
based on feature extraction of input image
Speedup at 32
TILE-Gx: 25
TILEPro: 27
23
0
5
10
15
20
25
30
1
10
100
1 2 4 8 16 32
Spe
ed
up
Exe
cuti
on
Tim
e (
s)
Number of Tiles
CBIR Performance with TSHMEM
TILE-Gx36 TILEPro64 TILE-Gx36 Speedup TILEPro64 Speedup
Conclusions
TSHMEM is a new OpenSHMEM library focused on fully leveraging Tilera many-core devices
Microbenchmarks define the expected and maximum realizable device performance TSHMEM reaches toward these microbenchmark results
with very little overhead for variety of functions
UDN leveraged for SHMEM barriers due to lacking performance of TMC barrier primitives on TILEPro
Performance, portability, scalability demonstrated via two app case studies
24
Thanks for listening!
25
Questions?
References 1. OpenSHMEM, “OpenSHMEM API, v1.0 final,” 2012. [Online]. Available:
http://www.openshmem.org/
2. B. C. Lam, A. D. George, and H. Lam, “An introduction to TSHMEM for shared-memory parallel computing on Tilera many-core processors,” in Proceedings of the 6th Conference on Partitioned Global Address Space Programing Models, ser. PGAS ’12. ACM, 2012.
3. Mellanox Technologies, “Mellanox ScalableSHMEM,” Sunnyvale, CA, USA, 2012. [Online]. Available: http://www.mellanox.com/related-docs/prod software/PB ScalableSHMEM.pdf
4. C. Yoon, V. Aggarwal, V. Hajare, A. D. George, and M. Billingsley, III, “GSHMEM, a portable library for lightweight, shared-memory, parallel programming,” in Proceedings of the 5th Conference on Partitioned Global Address Space Programing Models, ser. PGAS ’11. ACM, 2011.
5. D. Bonachea, “GASNet specification, v1.1,” University of California at Berkeley, Berkeley, CA, USA, Tech. Rep., 2002.
6. Tilera Corporation, “TILE-Gx8036 processor specification brief,” San Jose, CA, USA, 2012. [Online]. Available: http://www.tilera.com/sites/default/files/productbriefs/TILE-Gx8036 PB033-02.pdf
7. ——, “TILEPro64 processor product brief,” San Jose, CA, USA, 2011. [Online]. Available: http://www.tilera.com/sites/default/files/productbriefs/TILEPro64 Processor PB019 v4.pdf
8. J. Huang, S. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih, “Image indexing using color correlograms,” in Computer Vision and Pattern Recognition,1997. Proceedings., 1997 IEEE Computer Society Conference on, Jun 1997, pp. 762–768.
26