TSHMEM: Shared-Memory Parallel Computing on Tilera Many

IPDPS/HIPS’13

TSHMEM: Shared-Memory

Parallel Computing on Tilera

Many-Core Processors

May 20, 2013

Bryant C. Lam (speaker)

Alan D. George

Herman Lam

NSF Center for High-Performance Reconfigurable Computing (CHREC), University of Florida

Motivation

Emergent many-core devices vary considerably in architecture and performance characteristics e.g., scalable mesh networks, ring-bus topologies

Platforms provide separate APIs to individually leverage underlying hardware resources Difficult to develop efficient cross-platform apps

Objective Bridge the gap between many-core hardware

utilization and developer productivity

Approach Quantify performance of many-core devices via

application kernels and microbenchmarks

Develop TSHMEM to increase device utilization of various many-core platforms with arch-specific insights

2

Outline

Tilera Many-Core Chip Architectures

Microbenchmarks Shared Memory Copy

UDN Latency

TMC Spin/Sync Barriers

TSHMEM Computing Design Infrastructure

Performance Analysis Put/Get Transfers

Barrier Synchronization

Collectives

Application Case Studies

Conclusions, Q&A

3

Tilera TILE-Gx Architecture

64-bit VLIW processors

32k L1i cache, 32k L1d cache

256k L2 cache per tile

Up to 750 BOPS

Up to 200 Tbps of on-chip mesh

interconnect

Over 500 Gbps memory bandwidth

1 to 1.5 GHz operating frequency

Power consumption: 10 to 55W for

typical applications

2-4 DDR3 memory controllers

mPIPE delivers wire-speed packet

classification, processing, distribution

MiCA for cryptographic acceleration

VLIW: very long instruction word

BOPS: billion operations per second

mPIPE: multicore Programmable Intelligent Packet Engine

MiCA: Multicore iMesh Coprocessing Accelerator

TILE-Gx36 vs. TILEPro64

TILE-Gx8036 64-bit VLIW processors



Up to 750 billion op/sec

Up to 60 Tbps of on-chip mesh

interconnect

Over 500 Gbps memory bandwidth

1 to 1.5 GHz operating frequency

Power consumption: 10 to 55W for

typical applications

Two DDR3 memory controllers

TILEPro64 32-bit VLIW processors



Up to 443 billion op/sec

Up to 37 Tbps of on-chip

interconnect

Up to 50 Gbps of I/O bandwidth

700 MHz and 866 MHz operating

frequency

Power consumption: 19 – 23W

@700MHz under full load

Four DDR2 memory controllers

5

iMesh On-Chip Network Fabric

TILEPro64: MDN (Memory), CDN (Coherence), IDN (I/O),

TDN (Tile), UDN (User), STN (Static Network)

TILE-Gx36: Five networks instead of six (removed static network)

MDN replaced by QDN (reQuest) and RDN (Response) dynamic network

Memory controller has two QDN (0/1) and RDN (0/1) network connections

Reduced congestion in memory accesses from TILEPro

SDN (Share)

Provides accesses and coherency for cache system

IDN (Internal)

Used primarily for communication with external I/O

UDN (User)

Allows user applications to send data directly between tiles

UDN and IDN can be accessed by users

6

TILE-Gx36 Benchmarking

Why microbenchmark?

Determines empirically realizable TILE-Gx performance

Defines realistic upper bound for TSHMEM performance

Benchmarking Overview

Bandwidth of shared memory copy

UDN latency varying test pairs of

sender/receiver tiles

TMC spin/sync barrier primitives

7

L2 Cache

Shared-Memory-Copy Bandwidth

Bandwidth on iMesh networks to caches and memory controllers Shared memory performance

critical for TSHMEM

Bandwidth of memory operations influenced by 3 of 5 iMesh networks QDN: memory request network

RDN: memory response network

SDN: cache sharing network

Significant bandwidth performance transitions occur at cache-size limits L1 data cache: 3100 MB/s

L2 cache: 2700 MB/s

Memory-to-memory performance thereafter

8

L1 DCache

32k L1i cache

32k L1d cache


0

500

1000

1500

2000

2500

3000

3500

4 B

8 B

16

B

32

B

64

B

12

8 B

25

6 B

51

2 B

1 k

B

2 k

B

4 k

B

8 k

B

16

kB

32

kB

64

kB

12

8 k

B

25

6 k

B

51

2 k

B

1 M

B

2 M

B

4 M

B

8 M

B

16

MB

32

MB

64

MB

Ban

dw

idth

(M

B/s

)

Transfer Size

Shared Memory Copy TILE-Gx36: heap to sharedTILE-Gx36: shared to heapTILE-Gx36: shared to sharedTILEPro64: heap to sharedTILEPro64: shared to heapTILEPro64: shared to shared

UDN Performance on Gx36

Tilera User Dynamic

Network (UDN)

Hardware support for routing

of data packets between CPU

cores (tiles)

UDN microbenchmark

Measures average latency

between two tiles by ping-

ponging numerous packets

9

UDN benchmarks transfer between two tiles

Red for neighbors

Green for side to side

Blue for corner to corner

UDN Latency

Average latency of one-

way UDN transfers

TILEPro64’s slightly faster

latency is attributed to 32-

bit switching fabric

TILE-Gx uses 64-bit fabric;

yields roughly twice as much

data throughput over TILEPro

Low latency of UDN

attractive for fast inter-tile

communication

10

0

5

10

15

20

25

30

35

Neighbors Side-to-Side Corners

Late

ncy

(n

s)

Tile-to-Tile in 6x6 Area

UDN One-Way Average Latency

TILE-Gx36

TILEPro64

Type Direction Sender Receiver

Time (ns)

(6x6 area) TILE-Gx36 TILEPro64

Neighbors

left 14 13 21 19

right 14 15 22 19

up 14 8 22 18

down 14 20 22 18

left 28 27 21 19

right 28 29 22 19

up 28 22 22 18

down 28 34 22 18

Side-to-Side

right 6 11 26 25

left 11 6 25 25

down 1 31 26 24

up 31 1 26 24

right 23 18 25 25

left 18 23 26 25

down 33 3 26 24

up 3 33 26 24

Corners

down-right 0 35 32 33

up-left 35 0 31 33

down-left 5 30 31 33

up-right 30 5 32 33

0 1 4 5

6 11

24 29

30 31 34 35

TMC Spin/Sync Barriers

Barrier primitives provided by Tilera’s TMC library Spin uses spinlock,

discouraging process context switching Sync barrier rectifies this with

performance penalty

Barrier performance on TILEPro is order of magnitude slower than TILE-Gx TILEPro barriers potentially too

slow for use in SHMEM

11

0.1

1

10

100

1000

0 10 20 30 40

Late

ncy

(µ

s)

Number of Tiles

TMC Barrier Performance

TMC spin, TILE-Gx36 TMC spin, TILEPro64

TMC sync, TILE-Gx36 TMC sync, TILEPro64

OpenSHMEM and TSHMEM

TSHMEM Overview

Investigate SHMEM reference design directly over Tilera TILE-Gx architecture and libraries

Stay true to SHMEM principles High performance with low overhead

Portability with easy programmability

Maximize architectural benefits Tile interconnect, mPIPE, MiCA

Extend design to multi-device systems Evaluate interconnect capabilities Explore optimizations to point-to-point

transfers and collectives

12

HPC acceleration with SHMEM on many-core processors

Tilera Libraries(TMC, gxio, etc.)

TSHMEM with OpenSHMEM API

Setup

Device Functionality

Data Transfers

Sync

TSHMEM reference design on TILE-Gx36

Modular design

utilizing vendor

libraries

• Dynamic symmetric heap management

• Point-to-point data transfer

• Point-to-point synchronization

• Barrier synchronization

• Broadcast, Collection, Reduction

• Atomics for dynamic variables

• Extension to multiple many-core devices

Achieved

• Optimizations for multiple many-core

• Exploration of new SHMEM extensions

Ongoing

TSHMEM Initialization

Running TSHMEM applications Compile C or C++ program

with TSHMEM library tile-gcc provided as cross compiler for TILE-Gx

Transfer and run program using tile-monitor and shmemrun executable launcher

TSHMEM setup and initialization – shmemrun and start_pes() Initializes TMC Common Memory (CMEM) for shared

memory allocation

Initializes UDN for barrier sync and basic inter-tile communications

Creates processes and binds them to unique tiles

Allocates shared memory and broadcasts pointer

Synchronizes PEs before exiting routine

13

C or C++ SHMEM program

tile-gcc TSHMEM executable

tile-monitor shmemrun

Put/Get Transfers – Design

14

Template functions generated with macro definitions Dynamic symmetric

variable transfers Performed via memcpy

and shmem_ptr into TMC common memory

Point-to-point transfers for static symmetric variables Handled via UDN

interrupts and dynamic shared memory buffer allocation

SHMEM Get

Put/Get Transfers – Performance

15

TSHMEM dynamic performance closely matches

shared-to-shared performance profile for both

TILE-Gx36 and TILEPro64 devices

0

500

1000

1500

2000

2500

3000

35004

B

8 B

16

B

32

B

64

B

12

8 B

25

6 B

51

2 B

1 k

B

2 k

B

4 k

B

8 k

B

16

kB

32

kB

64

kB

12

8 k

B

25

6 k

B

51

2 k

B

1 M

B

2 M

B

4 M

B

Effe

ctiv

e B

and

wid

th (

MB

/s)

Transfer Size

TSHMEM Put/Get Performance

TILE-Gx36 dynamic put

TILE-Gx36 dynamic get

TILE-Gx36 static put

TILE-Gx36 static get

TILEPro64 dynamic put

TILEPro64 dynamic get

Static Put/Get Transfers – Perf.

16

TSHMEM static transfers attempt to offload request

if possible, otherwise a temporary buffer is

necessary and drastically degrades performance

0

500

1000

1500

2000

2500

3000

35004

B

8 B

16

B

32

B

64

B

12

8 B

25

6 B

51

2 B

1 k

B

2 k

B

4 k

B

8 k

B

16

kB

32

kB

64

kB

12

8 k

B

25

6 k

B

51

2 k

B

1 M

B

2 M

B

4 M

B

Effe

ctiv

e B

and

wid

th (

MB

/s)

Transfer Size

TSHMEM Put/Get Performance on TILE-Gx36

TILE-Gx36 dynamic putTILE-Gx36 dynamic getTILE-Gx36 dynamic-static putTILE-Gx36 dynamic-static getTILE-Gx36 static-dynamic putTILE-Gx36 static-dynamic getTILE-Gx36 static putTILE-Gx36 static get

Barrier Sync. – Design

Barrier uses UDN for messages SHMEM allows for barriers with overlapping active sets

Difficult to enable overlapping active sets with TMC barriers

Barrier primitives for TILEPro too slow

Barrier design leverages split-phase barrier internally but externally appears as blocking Semantics for shmem_barrier is blocking

Cores (tiles) linearly forward notify message until all receive it

Active-set information transmitted to avoid overlap errors from multiple barriers

Lead tile broadcasts release message to leave barrier routine

Significant performance improvements with increasing number of tiles on TILE-Gx

1.5 µs latency at 36 tiles

17

Performance – Barrier Sync

18

TSHMEM barriers leverage UDN for better scaling than

most Tilera TMC barriers for TILE-Gx36 and TILEPro64

0.1

1

10

100

1000

0 10 20 30 40

Late

ncy

(µ

s)

Number of Tiles

TMC Barrier Performance

TMC spin, TILE-Gx36 TMC spin, TILEPro64

TMC sync, TILE-Gx36 TMC sync, TILEPro64

0

0.5

1

1.5

2

2.5

3

3.5

0 5 10 15 20 25 30 35 40

Late

ncy

(µ

s)

Number of Tiles

TSHMEM Barrier Performance

TSHMEM, TILE-Gx36 (best)

TSHMEM, TILE-Gx36 (worst)

TSHMEM, TILEPro64

TMC spin, TILE-Gx36

Collectives – Types

Broadcast One-to-all transfer

Active-set tiles except root tile receive data from root tile’s memory

Collection All-to-all transfer

Active-set tiles linearly concatenate their data

Reduction All-to-all transfer

Active-set tiles perform reduction arithmetic operation on their data

19

0 1 2 3 4 5 6 7

Performance – Broadcast

Pull-based broadcast up to 46 GB/s aggregate bandwidth

at 29 tiles; 37 GB/s aggregate at 36 tiles

20

28

1420

2632

0

500

1000

1500

2000

2500

3000

3500

4 B

8 B

16

B3

2 B

64

B1

28

B2

56

B5

12

B1

kB

2 k

B4

kB

8 k

B1

6 k

B3

2 k

B6

4 k

B1

28

kB

25

6 k

B5

12

kB

1 M

B2

MB

4 M

B

Nu

mb

er

of

Tile

s

Agg

rega

te E

ffe

ctiv

e B

and

wid

th (

MB

/s)

Transfer Size

Push-based Broadcast on TILE-Gx36

3000-3500

2500-3000

2000-2500

1500-2000

1000-1500

500-1000

0-500

28

1420

2632

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

4 B

8 B

16

B3

2 B

64

B1

28

B2

56

B5

12

B1

kB

2 k

B4

kB

8 k

B1

6 k

B3

2 k

B6

4 k

B1

28

kB

25

6 k

B5

12

kB

1 M

B2

MB

4 M

B

Nu

mb

er

of

Tile

s

Agg

rega

te E

ffe

ctiv

e B

and

wid

th (

MB

/s)

Transfer Size

Pull-based Broadcast on TILE-Gx36

45000-5000040000-4500035000-4000030000-3500025000-3000020000-2500015000-2000010000-150005000-100000-5000

Performance – FCollect, Reduction

Results are for naïve fast-collect and reduction;

additional algorithms planned for testing

21

27

1217

2227

32

0

100

200

300

400

500

600

700

800

900

1000

4 B

8 B

16

B3

2 B

64

B1

28

B2

56

B5

12

B1

kB

2 k

B4

kB

8 k

B1

6 k

B3

2 k

B6

4 k

B1

28

kB

25

6 k

B5

12

kB

1 M

B

Nu

mb

er

of

Tile

s

Agg

rega

te E

ffe

ctiv

e B

and

wid

th (

MB

/s)

Transfer Size

Fast-Collect on TILE-Gx36

900-1000800-900700-800600-700500-600400-500300-400200-300100-2000-100

27

1217

2227

32

0

50

100

150

200

250

300

350

400

450

4 B

8 B

16

B3

2 B

64

B1

28

B2

56

B5

12

B1

kB

2 k

B4

kB

8 k

B1

6 k

B3

2 k

B6

4 k

B1

28

kB

25

6 k

B5

12

kB

1 M

B

Nu

mb

er

of

Tile

s

Agg

rega

te E

ffe

ctiv

e B

and

wid

th (

MB

/s)

Transfer Size

Reduction on TILE-Gx36

400-450350-400300-350250-300200-250150-200100-15050-1000-50

App Case – 2D-FFT

TILE-Gx executes order of magnitude faster than

TILEPro due to improved floating-point support

Speedup at 32

TILE-Gx: 5

TILEPro: 12

22

0

2

4

6

8

10

12

14

0.1

1

10

1 2 4 8 16 32

Spe

ed

up

Exe

cuti

on

Tim

e (

s)

Number of Tiles

2D-FFT Performance with TSHMEM

TILE-Gx36 TILEPro64 TILE-Gx36 Speedup TILEPro64 Speedup

App Case – CBIR

Content-based image retrieval with searches

based on feature extraction of input image

Speedup at 32

TILE-Gx: 25

TILEPro: 27

23

0

5

10

15

20

25

30

1

10

100

1 2 4 8 16 32

Spe

ed

up

Exe

cuti

on

Tim

e (

s)

Number of Tiles

CBIR Performance with TSHMEM

TILE-Gx36 TILEPro64 TILE-Gx36 Speedup TILEPro64 Speedup

Conclusions

TSHMEM is a new OpenSHMEM library focused on fully leveraging Tilera many-core devices

Microbenchmarks define the expected and maximum realizable device performance TSHMEM reaches toward these microbenchmark results

with very little overhead for variety of functions

UDN leveraged for SHMEM barriers due to lacking performance of TMC barrier primitives on TILEPro

Performance, portability, scalability demonstrated via two app case studies

24

Thanks for listening!

25

Questions?

References 1. OpenSHMEM, “OpenSHMEM API, v1.0 final,” 2012. [Online]. Available:

http://www.openshmem.org/

2. B. C. Lam, A. D. George, and H. Lam, “An introduction to TSHMEM for shared-memory parallel computing on Tilera many-core processors,” in Proceedings of the 6th Conference on Partitioned Global Address Space Programing Models, ser. PGAS ’12. ACM, 2012.

3. Mellanox Technologies, “Mellanox ScalableSHMEM,” Sunnyvale, CA, USA, 2012. [Online]. Available: http://www.mellanox.com/related-docs/prod software/PB ScalableSHMEM.pdf

4. C. Yoon, V. Aggarwal, V. Hajare, A. D. George, and M. Billingsley, III, “GSHMEM, a portable library for lightweight, shared-memory, parallel programming,” in Proceedings of the 5th Conference on Partitioned Global Address Space Programing Models, ser. PGAS ’11. ACM, 2011.

5. D. Bonachea, “GASNet specification, v1.1,” University of California at Berkeley, Berkeley, CA, USA, Tech. Rep., 2002.

6. Tilera Corporation, “TILE-Gx8036 processor specification brief,” San Jose, CA, USA, 2012. [Online]. Available: http://www.tilera.com/sites/default/files/productbriefs/TILE-Gx8036 PB033-02.pdf

7. ——, “TILEPro64 processor product brief,” San Jose, CA, USA, 2011. [Online]. Available: http://www.tilera.com/sites/default/files/productbriefs/TILEPro64 Processor PB019 v4.pdf

8. J. Huang, S. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih, “Image indexing using color correlograms,” in Computer Vision and Pattern Recognition,1997. Proceedings., 1997 IEEE Computer Society Conference on, Jun 1997, pp. 762–768.

26

Documents

TSHMEM: Shared-Memory Parallel Computing on Tilera Many