37
• Intel Confidential – Internal Use O Programming Systems Lab Programming Systems Lab McRT: Many-Core Runtime Ali Adl-Tabatabai Anwar Ghuloum Dong Yuan Chen Rick Hudson Vijay Menon Brian Murphy Tatiana Shpeisman Bratin Saha Programming Systems Lab, MTL/CTG

McRT: Many-Core Runtime

  • Upload
    svea

  • View
    88

  • Download
    1

Embed Size (px)

DESCRIPTION

McRT: Many-Core Runtime. Ali Adl-Tabatabai Anwar Ghuloum Dong Yuan Chen Rick Hudson. Vijay Menon Brian Murphy Tatiana Shpeisman Bratin Saha. Programming Systems Lab, MTL/CTG. What is McRT. scalable many-core runtime Support multiple programming models (pthread, OpenMP, …) - PowerPoint PPT Presentation

Citation preview

Page 1: McRT: Many-Core Runtime

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT: Many-Core RuntimeMcRT: Many-Core Runtime

Ali Adl-Tabatabai

Anwar Ghuloum

Dong Yuan Chen

Rick Hudson

Vijay Menon

Brian Murphy

Tatiana Shpeisman

Bratin Saha

Programming Systems Lab, MTL/CTG

Page 2: McRT: Many-Core Runtime

2

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

What is McRTWhat is McRT scalable many-core runtimescalable many-core runtime

Support multiple programming models (pthread, OpenMP, …)Support multiple programming models (pthread, OpenMP, …)

supports multiple platformssupports multiple platforms

Simulator, SMP and Simulator, SMP and sequestered systemssequestered systems

Page 3: McRT: Many-Core Runtime

3

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT: ArchitectureMcRT: Architecture

OpenMP BrookAdapters for Programming

Models

Thread Scheduler

ThreadSynchronization Profiling

ScalableCore

Services

MultipleExecutionPlatforms

Windows/Linux

IA-32 SMP

SMACSimulator

(TA)

Many CoreCache Simulator

(McPLS)

Memory Management

SequesteredCore System

CILK

Java Virtual

Machine

Applications& Libraries RMS

Workloads

Pthread

Media Workloads

NetworkProcessingWorkloads

ParallelPrimitives

Library…

CPUSimulator(Skeleton)

Page 4: McRT: Many-Core Runtime

4

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Distributed run queuesto reduce contention

Page 5: McRT: Many-Core Runtime

5

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Distributed run queuesto reduce contention

Program “main” goes into a queue

Page 6: McRT: Many-Core Runtime

6

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Program “main” gets picked by a processor

Page 7: McRT: Many-Core Runtime

7

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

New work gets added to run queues

Page 8: McRT: Many-Core Runtime

8

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Knob controls work sharing

Page 9: McRT: Many-Core Runtime

9

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Work sharingKeeping all cores busy

Page 10: McRT: Many-Core Runtime

10

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Work stealing• Idle processors look for work

in other cores• Knob controls degree of stealing

Page 11: McRT: Many-Core Runtime

11

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Work stealingReducing periods of idleness

Page 12: McRT: Many-Core Runtime

12

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

Sequestered Cores partitionWindows host partition

McRT On Sequestered CoresMcRT On Sequestered Cores

Main Core(s) Sequestered Core(s)

Windows + Driver

Threaded ApplicationThreaded Application

McRT

IPI / memory mapped PCI register based signaling

McRTScheduling, synchronization,

memory management, …

Light Weight Executive

Windows threadpartition

Page 13: McRT: Many-Core Runtime

13

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT-Sequestered Overview McRT-Sequestered Overview OS services (e.g. I/O) available only on the main coresOS services (e.g. I/O) available only on the main cores

Sequestered cores used as compute deviceSequestered cores used as compute device

Graphics, games, network processing, etc.Graphics, games, network processing, etc.

McRT manages threads on sequestered coresMcRT manages threads on sequestered cores

LWE provides boot services & exception handlingLWE provides boot services & exception handling

McRT partitions HW threads & allows migration between partitionsMcRT partitions HW threads & allows migration between partitions

Threads migrate from sequestered to main core for OS servicesThreads migrate from sequestered to main core for OS services

Thread migration transparent to programmerThread migration transparent to programmer

sequestered = abgesondert

Page 14: McRT: Many-Core Runtime

14

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT-Sequestered ModelMcRT-Sequestered Model

SequesteredCores

SequesteredCores

Windows Core

Program “main” added to sequestered queue

McRT divides the processors into

separate partitions

Page 15: McRT: Many-Core Runtime

15

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT-Sequestered ModelMcRT-Sequestered Model

SequesteredCores

SequesteredCores

Windows Core

Program “main” picked by sequestered processor

McRT divides the processors into

separate partitions

Page 16: McRT: Many-Core Runtime

16

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT-Sequestered ModelMcRT-Sequestered Model

SequesteredCores

SequesteredCores

Windows Core

Every partition is aseparate entity

New work added tosequestered queues

Page 17: McRT: Many-Core Runtime

17

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT-Sequestered ModelMcRT-Sequestered Model

SequesteredCores

SequesteredCores

Windows Core

Work sharing & stealing only within a partition

Every partition is aseparate entity

Page 18: McRT: Many-Core Runtime

18

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT-Sequestered ModelMcRT-Sequestered Model

SequesteredCores

SequesteredCores

Windows Core

A task can ask McRTto change partitions

e.g., migrateto OS partition, execute OS call & migrate back

Page 19: McRT: Many-Core Runtime

19

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

BackupBackup

Page 20: McRT: Many-Core Runtime

20

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT: Research AgendaMcRT: Research Agenda Common scalable many-core runtimeCommon scalable many-core runtime

Support multiple programming modelsSupport multiple programming models

Scalable runtime across multiple platformsScalable runtime across multiple platforms

Simulator, SMP and sequestered systemsSimulator, SMP and sequestered systems

Reliability and programmability featuresReliability and programmability features

Threading platform for domain specific & general-purpose Threading platform for domain specific & general-purpose languageslanguages

Runtime support for message passing systemsRuntime support for message passing systems

McRT: A scalable and reliable software environment McRT: A scalable and reliable software environment for the many-core platformfor the many-core platform

Page 21: McRT: Many-Core Runtime

21

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

OutlineOutline McRT overviewMcRT overview

McRT many-core simulation McRT many-core simulation

Results and key runtime scalability featuresResults and key runtime scalability features

McRT on SMP systemsMcRT on SMP systems

McRT on sequestered core systemMcRT on sequestered core system

ConclusionsConclusions

Page 22: McRT: Many-Core Runtime

22

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scalability: MPEG4 McRT Scalability: MPEG4

Nearly linear scaling till 64 HW threads on XviD MPEG4 encoder

OMP-XviD scaling

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18

Number of cores (4 threads per core)

Sp

ee

du

p o

ve

r 1

co

re (

4 t

hre

ad

s)

768P

1080P

Linear Scaling

OMP-Xvid Speedup on McRT-TA

Page 23: McRT: Many-Core Runtime

23

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Scalability: RMS KernelsMcRT Scalability: RMS Kernels

• All speedups are relative to execution time on a single core (4 threads)• Good scalability till 64 HW threads

SVD Speedup on McRT-TA

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18

Number of cores (4 threads per core)

Sp

eed

up

co

mp

ared

to

sin

gle

co

re (

4 th

read

s)

Linear

SVD

SOM Speedup on McRT-TA

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18

Number of cores (4 threads per core)

Sp

eed

up

co

mp

are

d t

o s

ing

le c

ore

(4

th

read

s)

SOM

Linear

BME Speedup on McRT-TA

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18

Number of cores (4 threads per core)

Sp

eed

up

co

mp

are

d t

o s

ing

le c

ore

(4

thre

ad

s)

BME

Linear

Page 24: McRT: Many-Core Runtime

24

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT: Key Scalability Features McRT: Key Scalability Features

User-level synchronization primitivesUser-level synchronization primitives Multiple locking algorithms & barrier implementationsMultiple locking algorithms & barrier implementations

User-level monitor & mwait for efficient HW spin waitingUser-level monitor & mwait for efficient HW spin waiting

User-level thread schedulerUser-level thread scheduler Supports 128+ HW threadsSupports 128+ HW threads

Continuation-based threading/ task-based modelContinuation-based threading/ task-based model

Distributed work queues with support for work stealing and sharingDistributed work queues with support for work stealing and sharing

Supports partitioning (used in sequestered platform)Supports partitioning (used in sequestered platform)

User-level memory manager User-level memory manager Size segregated thread local allocation poolsSize segregated thread local allocation pools

Completely non-blocking implementationCompletely non-blocking implementation

Page 25: McRT: Many-Core Runtime

25

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT Core Services: Scalability ImprovementsMcRT Core Services: Scalability Improvements

• Single queue gives best load balancing but suffers from contention• Queued locks deal better with contention at large # of HW threads• Distributed queues eliminate contention but don’t balance load• Stealing gives best of all worlds: load balancing + no contention

Speedup over single core - 4 HW threads per coreXviD - 1080p

0

2

4

6

8

10

12

14

4 thr-1 core 8 thr-2 core 16 thr-4 core 32 thr-8 core 64 thr-16 core

Number of HW threads & cores (log)

Sp

ee

du

p

Distributed queue + stealing

Distributed queue

Single queue + queued locks

Single queue + TTS locks

Page 26: McRT: Many-Core Runtime

26

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

Need For Custom SchedulingNeed For Custom Scheduling

XviD has loadimbalance among

tasks Stealing helps

Equake taskshave good load

balance Stealing adds

overhead

Instructions executed by different worker threads (32 HW thread config)

0.0E+00

5.0E+06

1.0E+07

1.5E+07

2.0E+07

2.5E+07

XviD Equake

inst

ruct

ion

s

Page 27: McRT: Many-Core Runtime

27

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

OutlineOutline McRT overviewMcRT overview

McRT many-core simulationMcRT many-core simulation

McRT on SMP systemsMcRT on SMP systems

Key challenges and resultsKey challenges and results

McRT on sequestered core systemMcRT on sequestered core system

ConclusionsConclusions

Page 28: McRT: Many-Core Runtime

28

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT On SMP SystemsMcRT On SMP Systems Key challenge:Key challenge:

Efficient coupling between user-level runtime & OSEfficient coupling between user-level runtime & OS

Key McRT features:Key McRT features:

Novel synchronization libraryNovel synchronization library

Queue based synchronization supporting cancellation and timeoutQueue based synchronization supporting cancellation and timeout

User-level spin waiting + scheduler-level blockingUser-level spin waiting + scheduler-level blocking

Linux & Windows kernel-level blocking for efficient 1:1 schedulingLinux & Windows kernel-level blocking for efficient 1:1 scheduling

Predicated continuations for efficient M:N schedulingPredicated continuations for efficient M:N scheduling

Non-blocking data structuresNon-blocking data structures

Provides preemption safety and greater resilience to thread delays Provides preemption safety and greater resilience to thread delays

Page 29: McRT: Many-Core Runtime

29

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT On SMP: ResultsMcRT On SMP: Results

Both McRT & native speedups are relative to the execution time for

1P on the native (OpenMP) runtime

McRT and the native (OpenMP) runtime

running on the same 16way IBM SMP Linux system

PLSA Speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18

Number of processors

Sp

eed

up

ove

r 1P

Nat

ive

Exe

cuti

on

ti

me

McRT speedup

Native speedup

SEMPHY Speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18Number of processors

Spe

edup

ove

r 1P

Nat

ive

Exe

cutio

n Ti

me

McRT speedup

Native speedup

Application uses standard OpenMP

Page 30: McRT: Many-Core Runtime

30

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

SEMPHY Speedup: DetailsSEMPHY Speedup: Details

McRT scheduler can provide the advantages of a task queue Better programmability

SEMPHY Speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18Number of processors

Spe

edup

ove

r 1P

Nat

ive

Exe

cutio

n Ti

me

McRT speedup

Native speedup

Application uses standard OpenMP

SEMPHY Speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18Number of processors

Spe

edup

ove

r 1P

Nat

ive

Exe

cutio

n Ti

me

McRT speedup

Native speedup

Application uses standard OpenMPSEMPHY speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18

Number of processors

Sp

eed

up

ove

r 1P

Nat

ive

Exe

cuti

on

T

ime

McRT speedup

Native speedup

Application uses Intel OpenMP task queue extension

SEMPHY speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18

Number of processors

Sp

eed

up

ove

r 1P

Nat

ive

Exe

cuti

on

T

ime

McRT speedup

Native speedup

Application uses Intel OpenMP task queue extension

Page 31: McRT: Many-Core Runtime

31

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

OutlineOutline McRT overviewMcRT overview

McRT many-core simulation McRT many-core simulation

McRT on SMP systemsMcRT on SMP systems

McRT on sequestered core systemMcRT on sequestered core system

Architecture, challenges, and resultsArchitecture, challenges, and results

ConclusionsConclusions

Page 32: McRT: Many-Core Runtime

32

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

Sequestered core StuffSequestered core Stuff See main part of presentationSee main part of presentation

Page 33: McRT: Many-Core Runtime

33

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT-Sequestered ResultsMcRT-Sequestered Results

Native: OpenMP on 8P SMP(all processors running Win 2003)McRT-OS: McRT on the same 8P SMP(all processors running Win 2003)McRT-BareM: McRT on the same 8P SMP(1P running Win 2003, 7P sequestered)

All speedups are relative to the

execution time for1P on the native

(OpenMP) runtime

K processor McRT-BareMetal mode has K-1 sequestered and 1 Win 2003 processor

Equake Speedup on McRT, Mcrt-BareMetal & Native (OpenMP) Runtime

0

2

4

6

8

10

12

0 2 4 6 8 10Number of processors

Sp

ee

du

p r

ela

tiv

e t

o 1

P

Na

tiv

e E

xe

cu

tio

n T

ime Native McRT-OS McRT-BareMetal

Page 34: McRT: Many-Core Runtime

34

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

ConclusionsConclusions Provide a scalable many-core software environment Provide a scalable many-core software environment

Support multiple parallel programming modelsSupport multiple parallel programming models

Abstract away the execution platformAbstract away the execution platform

Good performance on SMP, sequestered system and simulation Good performance on SMP, sequestered system and simulation

Enhance many-core reliability and programmabilityEnhance many-core reliability and programmability

Transactional memoryTransactional memory

Software virtualized transactional memorySoftware virtualized transactional memory

Transactional data structures and algorithmsTransactional data structures and algorithms

Speculative and implicit parallelismSpeculative and implicit parallelism

Page 35: McRT: Many-Core Runtime

35

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

CollaboratorsCollaborators Platform Architecture Research(PAR/MTL): McPLS simulatorPlatform Architecture Research(PAR/MTL): McPLS simulator

Architecture Research Lab(ARL/MTL): RMS workloadsArchitecture Research Lab(ARL/MTL): RMS workloads

PDSD (SSG): OpenMP libraryPDSD (SSG): OpenMP library

Doug Carmean, Eric Sprangle, Anwar Rohillah: TA simulatorDoug Carmean, Eric Sprangle, Anwar Rohillah: TA simulator

Streaming Media Lab (SMAL/MTL): Sequestered core Streaming Media Lab (SMAL/MTL): Sequestered core systemsystem

Network Architecture Lab (NAL/CTL): Packet processing Network Architecture Lab (NAL/CTL): Packet processing applicationsapplications

Page 36: McRT: Many-Core Runtime

36

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

BackupBackup

Page 37: McRT: Many-Core Runtime

37

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

Nehalem Bonnell ComparisonNehalem Bonnell ComparisonNehalem (NHM) vs. Bonnell (BNL)

XviD-480p

0

1

2

3

4

5

6

7

8

1 101 201 301 401

Instructions executed (millions)

IPC

NHM 1 thr-1 core

BNL 4 thr-1 core

BNL 8 thr-2 core

BNL 16 thr-4 core

• Nehalem simulated with Skeleton• Bonnell simulated with TA• Instruction counts & execution phases line up nicely