27
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware 2010-04-23 MTAAP’2010 1

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

Embed Size (px)

Citation preview

Page 1: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 1Atlanta, Georgia

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OSHandong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware2010-04-23

Page 2: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 2

Introduction

• Modern OS based upon a sequential execution model (the von Neumann model).

• Rapid progress of multi-core/many-core chip technology.– Parallel Computer systems now

implemented on single chips.

Page 3: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 3

Introduction

• Conventional OS model must adapt to the underlying changes.

• Further exploit the many levels of parallelism.– Hardware as well as Software

• We introduce a study on how to do this adaptation for the IBM BlueGene/P multi-core system.

Page 4: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 4

Outline

• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 5: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 5

Contributions

• Isolate traditional OS functions to a single core of the BG/P multi-core chip.

• Ported the TiNy Thread (TNT) execution model to allow for further utilization of BG/P compute cores.

• Expanded the design framework to a multi-chip system designed for scalability to a large number of chips.

Page 6: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 6

Outline

• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 7: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 7

TiNy Threads on BG/P

• TiNy Threads– Lightweight, non-preemptive, threads– API similar to POSIX Threads.– Originally presented in “TiNy Threads: A Thread Virtual

Machine for the Cyclopse-64 Cellular Architecture”– Runs on IBM Cyclops64

• Kernel Modifications– Alterations to the thread scheduler to allow for non-

preemption

Page 8: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 8

Outline

• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 9: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 9

Multinode Thread Scheduler

• Thread Scheduler allows TNT to run across multiple nodes.– Requests facilitated through IBM’s

Deep Computing Messaging Framework’s RPCs.

• Multiple Scheduling Algorithms– Workload Un-Aware

• Random• Round-Robin

– Workload Aware

Page 10: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 10

Multinode Thread Scheduling

Node A Node B

tnt_create()…

tnt_join()

tid

…tnt_exit()

…tnt_join()

Page 11: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 11

Outline

• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 12: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 12

Synchronization

• Three forms– Mutex– Thread Joining– Barrier

• Similar to thread scheduling– Lock requests, Join requests, and

Barrier notifications sent to node responsible for said synchronization

Page 13: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 13

Multinode Thread Scheduling

Node A Node B

tid

…tnt_exit()

tnt_join()

A

tnt_exit()

Page 14: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 14

Outline

• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 15: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 15

Characteristics of TDSM

• Provides One-Sided access to memory distributed among nodes through IBM’s DCMF.

• Allows for virtual address manipulation– Maps distributed memory to a single virtual

address space.– Allows for array indexing and memory offsets.

• Scalable to a variety of applications– Size of desired global shared memory set at

runtime.• Mutability

– Memory allocation algorithm and memory distribution algorithm can be easily altered and/or replaced.

Page 16: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

16

Example of TDSM

0global

15 30 45

tdsm_read(global[15], local, 20*sizeof(int));

global[15] to global[34]

0x00040012

Node 6Node 5 Node 7

Local Buffer0x0004004E

to0x0004009A

Node 6: 0 to 14and

Node 7: 0 to 4

Page 17: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 17

Outline

• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 18: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 18

Summary of Results

• The performance of the TNT thread system shows comparable speedup to that of Pthreads running on the same hardware.

• The distributed shared memory operates at 95% of the experimental peak performance of the network, with distance between nodes not being a sensitive factor.

• The cost of thread creation shows a linear relationship as the number of threads increase.

• The cost of waiting at a barrier is constant and independent of the number of threads involved.

Page 19: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 19

Single-Node Thread System Performance

• Based upon Radix-2 Cooley-Tukey algorithm with the Kiss FFT library for the underlying DFT.

• Underlying TNT thread model performs comparably to POSIX standard when the number of threads does not exceed the number of available processor cores.

1024 2048 4096 8192 16384 327680

0.5

1

1.5

2

2.5

TNT POSIX

Size of FFT

Abs

olut

e Sp

eed-

Up

for

Tw

o T

hrea

ds

Page 20: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 20

Memory System Performance

• Reads and writes of varying sizes between one and two nodes.

• For inter-node communications, data can be transferred at approximately 357 MB/s.

• Kumar et al determined experimental peak performance on BG/P to be 374 MB/s in their ICS’08 paper.

0 10485760

0.0005

0.001

0.0015

0.002

0.0025

0.003

Local Read

Local Write

Remote Read

Remote Write

bytes

late

ncy

(s)

Page 21: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 21

Memory System Performance

• Size of Read/Write is a function of the number of nodes across which the data is distributed.

• Latency linearly increases as the amount of data increases, regardless of distance between nodes.

0 256 512 768 10240

0.001

0.002

0.003

0.004

0.005

0.006

Read

Write

Nodes

late

ncy

(s)

Page 22: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 22

Multinode Thread Creation Cost

• Approximately 0.2 seconds per thread

• Remained effectively constant

0 100 200 300 400 500 6000

20

40

60

80

100

120

Number of Threads

Seco

nds

Page 23: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 23

Synchronization Costs

• Performance of barrier is effectively a constant 0.2 seconds.

110

020

030

040

050

060

070

080

090

010

000.1999

0.19992

0.19994

0.19996

0.19998

0.2

0.20002

0.20004

Number of Threads Included in Barrier

Tim

e (

s)

Page 24: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 24

Outline

• Introduction• Contributions• TNT on BlueGene/P• Scheduling TNT across nodes• Synchronization across nodes• TNT Distributed Shared Memory• Results• Conclusions and Future Work

Page 25: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 25

Conclusions and Future Work

• Proven feasibility of system• Benefits of Execution Model-Driven

approach• Room for Improvement

– Improvements to kernel– More rigorous benchmarks– Improved allocation and scheduling

algorithms

Page 26: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 26Atlanta, Georgia

Thank You

Page 27: Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang

MTAAP’2010 27

Bibliography• J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, “Tiny

threads: A thread virtual machine for the cyclops64 cellular architecture,” Parallel and Distributed Processing Symposium, International, vol. 15, p. 265b, 2005.

• S. Kumar, G. Dozsa, G. Almasi et al., “The deep computing messaging framework: generalized scalable message passing on the blue gene/p supercomputer,” in ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 94–103.

• M. Borgerding, “Kiss FFT.” December 2009, http://sourceforge.net/projects/kissfft/.