McRT: Many-Core Runtime

• Intel Confidential – Internal Use Only •

Programming Systems LabProgramming Systems Lab

McRT: Many-Core RuntimeMcRT: Many-Core Runtime

Ali Adl-Tabatabai

Anwar Ghuloum

Dong Yuan Chen

Rick Hudson

Vijay Menon

Brian Murphy

Tatiana Shpeisman

Bratin Saha

Programming Systems Lab, MTL/CTG

2



What is McRTWhat is McRT scalable many-core runtimescalable many-core runtime

Support multiple programming models (pthread, OpenMP, …)Support multiple programming models (pthread, OpenMP, …)

supports multiple platformssupports multiple platforms

Simulator, SMP and Simulator, SMP and sequestered systemssequestered systems

3



McRT: ArchitectureMcRT: Architecture

OpenMP BrookAdapters for Programming

Models

Thread Scheduler

ThreadSynchronization Profiling

ScalableCore

Services

MultipleExecutionPlatforms

Windows/Linux

IA-32 SMP

SMACSimulator

(TA)

Many CoreCache Simulator

(McPLS)

Memory Management

SequesteredCore System

CILK

…

Java Virtual

Machine

Applications& Libraries RMS

Workloads

Pthread

Media Workloads

NetworkProcessingWorkloads

ParallelPrimitives

Library…

…

CPUSimulator(Skeleton)

4



McRT Scheduler DetailsMcRT Scheduler Details

Core with2 HTs

Core with2 HTs

Core with2 HTs

Distributed run queuesto reduce contention

5




Core with2 HTs

Core with2 HTs

Core with2 HTs

Distributed run queuesto reduce contention

Program “main” goes into a queue

6




Core with2 HTs

Core with2 HTs

Core with2 HTs

Program “main” gets picked by a processor

7




Core with2 HTs

Core with2 HTs

Core with2 HTs

New work gets added to run queues

8




Core with2 HTs

Core with2 HTs

Core with2 HTs

Knob controls work sharing

9




Core with2 HTs

Core with2 HTs

Core with2 HTs

Work sharingKeeping all cores busy

10




Core with2 HTs

Core with2 HTs

Core with2 HTs

Work stealing• Idle processors look for work

in other cores• Knob controls degree of stealing

11




Core with2 HTs

Core with2 HTs

Core with2 HTs

Work stealingReducing periods of idleness

12



Sequestered Cores partitionWindows host partition

McRT On Sequestered CoresMcRT On Sequestered Cores

Main Core(s) Sequestered Core(s)

Windows + Driver

Threaded ApplicationThreaded Application

McRT

IPI / memory mapped PCI register based signaling

McRTScheduling, synchronization,

memory management, …

Light Weight Executive

Windows threadpartition

13



McRT-Sequestered Overview McRT-Sequestered Overview OS services (e.g. I/O) available only on the main coresOS services (e.g. I/O) available only on the main cores

Sequestered cores used as compute deviceSequestered cores used as compute device

Graphics, games, network processing, etc.Graphics, games, network processing, etc.

McRT manages threads on sequestered coresMcRT manages threads on sequestered cores

LWE provides boot services & exception handlingLWE provides boot services & exception handling

McRT partitions HW threads & allows migration between partitionsMcRT partitions HW threads & allows migration between partitions

Threads migrate from sequestered to main core for OS servicesThreads migrate from sequestered to main core for OS services

Thread migration transparent to programmerThread migration transparent to programmer

sequestered = abgesondert

14



McRT-Sequestered ModelMcRT-Sequestered Model

SequesteredCores

SequesteredCores

Windows Core

Program “main” added to sequestered queue

McRT divides the processors into

separate partitions

15




SequesteredCores

SequesteredCores

Windows Core

Program “main” picked by sequestered processor

McRT divides the processors into

separate partitions

16




SequesteredCores

SequesteredCores

Windows Core

Every partition is aseparate entity

New work added tosequestered queues

17




SequesteredCores

SequesteredCores

Windows Core

Work sharing & stealing only within a partition

Every partition is aseparate entity

18




SequesteredCores

SequesteredCores

Windows Core

A task can ask McRTto change partitions

e.g., migrateto OS partition, execute OS call & migrate back

19



BackupBackup

20



McRT: Research AgendaMcRT: Research Agenda Common scalable many-core runtimeCommon scalable many-core runtime

Support multiple programming modelsSupport multiple programming models

Scalable runtime across multiple platformsScalable runtime across multiple platforms

Simulator, SMP and sequestered systemsSimulator, SMP and sequestered systems

Reliability and programmability featuresReliability and programmability features

Threading platform for domain specific & general-purpose Threading platform for domain specific & general-purpose languageslanguages

Runtime support for message passing systemsRuntime support for message passing systems

McRT: A scalable and reliable software environment McRT: A scalable and reliable software environment for the many-core platformfor the many-core platform

21



OutlineOutline McRT overviewMcRT overview

McRT many-core simulation McRT many-core simulation

Results and key runtime scalability featuresResults and key runtime scalability features

McRT on SMP systemsMcRT on SMP systems

McRT on sequestered core systemMcRT on sequestered core system

ConclusionsConclusions

22



McRT Scalability: MPEG4 McRT Scalability: MPEG4

Nearly linear scaling till 64 HW threads on XviD MPEG4 encoder

OMP-XviD scaling

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18

Number of cores (4 threads per core)

Sp

ee

du

p o

ve

r 1

co

re (

4 t

hre

ad

s)

768P

1080P

Linear Scaling

OMP-Xvid Speedup on McRT-TA

23



McRT Scalability: RMS KernelsMcRT Scalability: RMS Kernels

• All speedups are relative to execution time on a single core (4 threads)• Good scalability till 64 HW threads

SVD Speedup on McRT-TA

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18


Sp

eed

up

co

mp

ared

to

sin

gle

co

re (

4 th

read

s)

Linear

SVD

SOM Speedup on McRT-TA

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18


Sp

eed

up

co

mp

are

d t

o s

ing

le c

ore

(4

th

read

s)

SOM

Linear

BME Speedup on McRT-TA

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18


Sp

eed

up

co

mp

are

d t

o s

ing

le c

ore

(4

thre

ad

s)

BME

Linear

24



McRT: Key Scalability Features McRT: Key Scalability Features

User-level synchronization primitivesUser-level synchronization primitives Multiple locking algorithms & barrier implementationsMultiple locking algorithms & barrier implementations

User-level monitor & mwait for efficient HW spin waitingUser-level monitor & mwait for efficient HW spin waiting

User-level thread schedulerUser-level thread scheduler Supports 128+ HW threadsSupports 128+ HW threads

Continuation-based threading/ task-based modelContinuation-based threading/ task-based model

Distributed work queues with support for work stealing and sharingDistributed work queues with support for work stealing and sharing

Supports partitioning (used in sequestered platform)Supports partitioning (used in sequestered platform)

User-level memory manager User-level memory manager Size segregated thread local allocation poolsSize segregated thread local allocation pools

Completely non-blocking implementationCompletely non-blocking implementation

25



McRT Core Services: Scalability ImprovementsMcRT Core Services: Scalability Improvements

• Single queue gives best load balancing but suffers from contention• Queued locks deal better with contention at large # of HW threads• Distributed queues eliminate contention but don’t balance load• Stealing gives best of all worlds: load balancing + no contention

Speedup over single core - 4 HW threads per coreXviD - 1080p

0

2

4

6

8

10

12

14

4 thr-1 core 8 thr-2 core 16 thr-4 core 32 thr-8 core 64 thr-16 core

Number of HW threads & cores (log)

Sp

ee

du

p

Distributed queue + stealing

Distributed queue

Single queue + queued locks

Single queue + TTS locks

26



Need For Custom SchedulingNeed For Custom Scheduling

XviD has loadimbalance among

tasks Stealing helps

Equake taskshave good load

balance Stealing adds

overhead

Instructions executed by different worker threads (32 HW thread config)

0.0E+00

5.0E+06

1.0E+07

1.5E+07

2.0E+07

2.5E+07

XviD Equake

inst

ruct

ion

s

27




McRT many-core simulationMcRT many-core simulation


Key challenges and resultsKey challenges and results



28



McRT On SMP SystemsMcRT On SMP Systems Key challenge:Key challenge:

Efficient coupling between user-level runtime & OSEfficient coupling between user-level runtime & OS

Key McRT features:Key McRT features:

Novel synchronization libraryNovel synchronization library

Queue based synchronization supporting cancellation and timeoutQueue based synchronization supporting cancellation and timeout

User-level spin waiting + scheduler-level blockingUser-level spin waiting + scheduler-level blocking

Linux & Windows kernel-level blocking for efficient 1:1 schedulingLinux & Windows kernel-level blocking for efficient 1:1 scheduling

Predicated continuations for efficient M:N schedulingPredicated continuations for efficient M:N scheduling

Non-blocking data structuresNon-blocking data structures

Provides preemption safety and greater resilience to thread delays Provides preemption safety and greater resilience to thread delays

29



McRT On SMP: ResultsMcRT On SMP: Results

Both McRT & native speedups are relative to the execution time for

1P on the native (OpenMP) runtime

McRT and the native (OpenMP) runtime

running on the same 16way IBM SMP Linux system

PLSA Speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18

Number of processors

Sp

eed

up

ove

r 1P

Nat

ive

Exe

cuti

on

ti

me

McRT speedup

Native speedup

SEMPHY Speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18Number of processors

Spe

edup

ove

r 1P

Nat

ive

Exe

cutio

n Ti

me

McRT speedup

Native speedup

Application uses standard OpenMP

30



SEMPHY Speedup: DetailsSEMPHY Speedup: Details

McRT scheduler can provide the advantages of a task queue Better programmability


0

2

4

6

8

10

12

14

16


Spe

edup

ove

r 1P

Nat

ive

Exe

cutio

n Ti

me

McRT speedup

Native speedup

Application uses standard OpenMP


0

2

4

6

8

10

12

14

16


Spe

edup

ove

r 1P

Nat

ive

Exe

cutio

n Ti

me

McRT speedup

Native speedup

Application uses standard OpenMPSEMPHY speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18


Sp

eed

up

ove

r 1P

Nat

ive

Exe

cuti

on

T

ime

McRT speedup

Native speedup

Application uses Intel OpenMP task queue extension

SEMPHY speedup on McRT and Native Runtime

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18


Sp

eed

up

ove

r 1P

Nat

ive

Exe

cuti

on

T

ime

McRT speedup

Native speedup

Application uses Intel OpenMP task queue extension

31




McRT many-core simulation McRT many-core simulation



Architecture, challenges, and resultsArchitecture, challenges, and results


32



Sequestered core StuffSequestered core Stuff See main part of presentationSee main part of presentation

33



McRT-Sequestered ResultsMcRT-Sequestered Results

Native: OpenMP on 8P SMP(all processors running Win 2003)McRT-OS: McRT on the same 8P SMP(all processors running Win 2003)McRT-BareM: McRT on the same 8P SMP(1P running Win 2003, 7P sequestered)

All speedups are relative to the

execution time for1P on the native

(OpenMP) runtime

K processor McRT-BareMetal mode has K-1 sequestered and 1 Win 2003 processor

Equake Speedup on McRT, Mcrt-BareMetal & Native (OpenMP) Runtime

0

2

4

6

8

10

12

0 2 4 6 8 10Number of processors

Sp

ee

du

p r

ela

tiv

e t

o 1

P

Na

tiv

e E

xe

cu

tio

n T

ime Native McRT-OS McRT-BareMetal

34



ConclusionsConclusions Provide a scalable many-core software environment Provide a scalable many-core software environment

Support multiple parallel programming modelsSupport multiple parallel programming models

Abstract away the execution platformAbstract away the execution platform

Good performance on SMP, sequestered system and simulation Good performance on SMP, sequestered system and simulation

Enhance many-core reliability and programmabilityEnhance many-core reliability and programmability

Transactional memoryTransactional memory

Software virtualized transactional memorySoftware virtualized transactional memory

Transactional data structures and algorithmsTransactional data structures and algorithms

Speculative and implicit parallelismSpeculative and implicit parallelism

35



CollaboratorsCollaborators Platform Architecture Research(PAR/MTL): McPLS simulatorPlatform Architecture Research(PAR/MTL): McPLS simulator

Architecture Research Lab(ARL/MTL): RMS workloadsArchitecture Research Lab(ARL/MTL): RMS workloads

PDSD (SSG): OpenMP libraryPDSD (SSG): OpenMP library

Doug Carmean, Eric Sprangle, Anwar Rohillah: TA simulatorDoug Carmean, Eric Sprangle, Anwar Rohillah: TA simulator

Streaming Media Lab (SMAL/MTL): Sequestered core Streaming Media Lab (SMAL/MTL): Sequestered core systemsystem

Network Architecture Lab (NAL/CTL): Packet processing Network Architecture Lab (NAL/CTL): Packet processing applicationsapplications

36



BackupBackup

37



Nehalem Bonnell ComparisonNehalem Bonnell ComparisonNehalem (NHM) vs. Bonnell (BNL)

XviD-480p

0

1

2

3

4

5

6

7

8

1 101 201 301 401

Instructions executed (millions)

IPC

NHM 1 thr-1 core

BNL 4 thr-1 core

BNL 8 thr-2 core

BNL 16 thr-4 core

• Nehalem simulated with Skeleton• Bonnell simulated with TA• Instruction counts & execution phases line up nicely

Documents

McRT: Many-Core Runtime