Upload
svea
View
88
Download
1
Embed Size (px)
DESCRIPTION
McRT: Many-Core Runtime. Ali Adl-Tabatabai Anwar Ghuloum Dong Yuan Chen Rick Hudson. Vijay Menon Brian Murphy Tatiana Shpeisman Bratin Saha. Programming Systems Lab, MTL/CTG. What is McRT. scalable many-core runtime Support multiple programming models (pthread, OpenMP, …) - PowerPoint PPT Presentation
Citation preview
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT: Many-Core RuntimeMcRT: Many-Core Runtime
Ali Adl-Tabatabai
Anwar Ghuloum
Dong Yuan Chen
Rick Hudson
Vijay Menon
Brian Murphy
Tatiana Shpeisman
Bratin Saha
Programming Systems Lab, MTL/CTG
2
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
What is McRTWhat is McRT scalable many-core runtimescalable many-core runtime
Support multiple programming models (pthread, OpenMP, …)Support multiple programming models (pthread, OpenMP, …)
supports multiple platformssupports multiple platforms
Simulator, SMP and Simulator, SMP and sequestered systemssequestered systems
3
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT: ArchitectureMcRT: Architecture
OpenMP BrookAdapters for Programming
Models
Thread Scheduler
ThreadSynchronization Profiling
ScalableCore
Services
MultipleExecutionPlatforms
Windows/Linux
IA-32 SMP
SMACSimulator
(TA)
Many CoreCache Simulator
(McPLS)
Memory Management
SequesteredCore System
CILK
…
Java Virtual
Machine
Applications& Libraries RMS
Workloads
Pthread
Media Workloads
NetworkProcessingWorkloads
ParallelPrimitives
Library…
…
CPUSimulator(Skeleton)
4
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
Distributed run queuesto reduce contention
5
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
Distributed run queuesto reduce contention
Program “main” goes into a queue
6
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
Program “main” gets picked by a processor
7
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
New work gets added to run queues
8
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
Knob controls work sharing
9
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
Work sharingKeeping all cores busy
10
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
Work stealing• Idle processors look for work
in other cores• Knob controls degree of stealing
11
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scheduler DetailsMcRT Scheduler Details
Core with2 HTs
Core with2 HTs
Core with2 HTs
Work stealingReducing periods of idleness
12
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
Sequestered Cores partitionWindows host partition
McRT On Sequestered CoresMcRT On Sequestered Cores
Main Core(s) Sequestered Core(s)
Windows + Driver
Threaded ApplicationThreaded Application
McRT
IPI / memory mapped PCI register based signaling
McRTScheduling, synchronization,
memory management, …
Light Weight Executive
Windows threadpartition
13
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT-Sequestered Overview McRT-Sequestered Overview OS services (e.g. I/O) available only on the main coresOS services (e.g. I/O) available only on the main cores
Sequestered cores used as compute deviceSequestered cores used as compute device
Graphics, games, network processing, etc.Graphics, games, network processing, etc.
McRT manages threads on sequestered coresMcRT manages threads on sequestered cores
LWE provides boot services & exception handlingLWE provides boot services & exception handling
McRT partitions HW threads & allows migration between partitionsMcRT partitions HW threads & allows migration between partitions
Threads migrate from sequestered to main core for OS servicesThreads migrate from sequestered to main core for OS services
Thread migration transparent to programmerThread migration transparent to programmer
sequestered = abgesondert
14
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT-Sequestered ModelMcRT-Sequestered Model
SequesteredCores
SequesteredCores
Windows Core
Program “main” added to sequestered queue
McRT divides the processors into
separate partitions
15
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT-Sequestered ModelMcRT-Sequestered Model
SequesteredCores
SequesteredCores
Windows Core
Program “main” picked by sequestered processor
McRT divides the processors into
separate partitions
16
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT-Sequestered ModelMcRT-Sequestered Model
SequesteredCores
SequesteredCores
Windows Core
Every partition is aseparate entity
New work added tosequestered queues
17
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT-Sequestered ModelMcRT-Sequestered Model
SequesteredCores
SequesteredCores
Windows Core
Work sharing & stealing only within a partition
Every partition is aseparate entity
18
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT-Sequestered ModelMcRT-Sequestered Model
SequesteredCores
SequesteredCores
Windows Core
A task can ask McRTto change partitions
e.g., migrateto OS partition, execute OS call & migrate back
19
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
BackupBackup
20
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT: Research AgendaMcRT: Research Agenda Common scalable many-core runtimeCommon scalable many-core runtime
Support multiple programming modelsSupport multiple programming models
Scalable runtime across multiple platformsScalable runtime across multiple platforms
Simulator, SMP and sequestered systemsSimulator, SMP and sequestered systems
Reliability and programmability featuresReliability and programmability features
Threading platform for domain specific & general-purpose Threading platform for domain specific & general-purpose languageslanguages
Runtime support for message passing systemsRuntime support for message passing systems
McRT: A scalable and reliable software environment McRT: A scalable and reliable software environment for the many-core platformfor the many-core platform
21
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
OutlineOutline McRT overviewMcRT overview
McRT many-core simulation McRT many-core simulation
Results and key runtime scalability featuresResults and key runtime scalability features
McRT on SMP systemsMcRT on SMP systems
McRT on sequestered core systemMcRT on sequestered core system
ConclusionsConclusions
22
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scalability: MPEG4 McRT Scalability: MPEG4
Nearly linear scaling till 64 HW threads on XviD MPEG4 encoder
OMP-XviD scaling
0
2
4
6
8
10
12
14
16
18
0 2 4 6 8 10 12 14 16 18
Number of cores (4 threads per core)
Sp
ee
du
p o
ve
r 1
co
re (
4 t
hre
ad
s)
768P
1080P
Linear Scaling
OMP-Xvid Speedup on McRT-TA
23
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Scalability: RMS KernelsMcRT Scalability: RMS Kernels
• All speedups are relative to execution time on a single core (4 threads)• Good scalability till 64 HW threads
SVD Speedup on McRT-TA
0
2
4
6
8
10
12
14
16
18
0 2 4 6 8 10 12 14 16 18
Number of cores (4 threads per core)
Sp
eed
up
co
mp
ared
to
sin
gle
co
re (
4 th
read
s)
Linear
SVD
SOM Speedup on McRT-TA
0
2
4
6
8
10
12
14
16
18
0 2 4 6 8 10 12 14 16 18
Number of cores (4 threads per core)
Sp
eed
up
co
mp
are
d t
o s
ing
le c
ore
(4
th
read
s)
SOM
Linear
BME Speedup on McRT-TA
0
2
4
6
8
10
12
14
16
18
0 2 4 6 8 10 12 14 16 18
Number of cores (4 threads per core)
Sp
eed
up
co
mp
are
d t
o s
ing
le c
ore
(4
thre
ad
s)
BME
Linear
24
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT: Key Scalability Features McRT: Key Scalability Features
User-level synchronization primitivesUser-level synchronization primitives Multiple locking algorithms & barrier implementationsMultiple locking algorithms & barrier implementations
User-level monitor & mwait for efficient HW spin waitingUser-level monitor & mwait for efficient HW spin waiting
User-level thread schedulerUser-level thread scheduler Supports 128+ HW threadsSupports 128+ HW threads
Continuation-based threading/ task-based modelContinuation-based threading/ task-based model
Distributed work queues with support for work stealing and sharingDistributed work queues with support for work stealing and sharing
Supports partitioning (used in sequestered platform)Supports partitioning (used in sequestered platform)
User-level memory manager User-level memory manager Size segregated thread local allocation poolsSize segregated thread local allocation pools
Completely non-blocking implementationCompletely non-blocking implementation
25
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT Core Services: Scalability ImprovementsMcRT Core Services: Scalability Improvements
• Single queue gives best load balancing but suffers from contention• Queued locks deal better with contention at large # of HW threads• Distributed queues eliminate contention but don’t balance load• Stealing gives best of all worlds: load balancing + no contention
Speedup over single core - 4 HW threads per coreXviD - 1080p
0
2
4
6
8
10
12
14
4 thr-1 core 8 thr-2 core 16 thr-4 core 32 thr-8 core 64 thr-16 core
Number of HW threads & cores (log)
Sp
ee
du
p
Distributed queue + stealing
Distributed queue
Single queue + queued locks
Single queue + TTS locks
26
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
Need For Custom SchedulingNeed For Custom Scheduling
XviD has loadimbalance among
tasks Stealing helps
Equake taskshave good load
balance Stealing adds
overhead
Instructions executed by different worker threads (32 HW thread config)
0.0E+00
5.0E+06
1.0E+07
1.5E+07
2.0E+07
2.5E+07
XviD Equake
inst
ruct
ion
s
27
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
OutlineOutline McRT overviewMcRT overview
McRT many-core simulationMcRT many-core simulation
McRT on SMP systemsMcRT on SMP systems
Key challenges and resultsKey challenges and results
McRT on sequestered core systemMcRT on sequestered core system
ConclusionsConclusions
28
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT On SMP SystemsMcRT On SMP Systems Key challenge:Key challenge:
Efficient coupling between user-level runtime & OSEfficient coupling between user-level runtime & OS
Key McRT features:Key McRT features:
Novel synchronization libraryNovel synchronization library
Queue based synchronization supporting cancellation and timeoutQueue based synchronization supporting cancellation and timeout
User-level spin waiting + scheduler-level blockingUser-level spin waiting + scheduler-level blocking
Linux & Windows kernel-level blocking for efficient 1:1 schedulingLinux & Windows kernel-level blocking for efficient 1:1 scheduling
Predicated continuations for efficient M:N schedulingPredicated continuations for efficient M:N scheduling
Non-blocking data structuresNon-blocking data structures
Provides preemption safety and greater resilience to thread delays Provides preemption safety and greater resilience to thread delays
29
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT On SMP: ResultsMcRT On SMP: Results
Both McRT & native speedups are relative to the execution time for
1P on the native (OpenMP) runtime
McRT and the native (OpenMP) runtime
running on the same 16way IBM SMP Linux system
PLSA Speedup on McRT and Native Runtime
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16 18
Number of processors
Sp
eed
up
ove
r 1P
Nat
ive
Exe
cuti
on
ti
me
McRT speedup
Native speedup
SEMPHY Speedup on McRT and Native Runtime
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16 18Number of processors
Spe
edup
ove
r 1P
Nat
ive
Exe
cutio
n Ti
me
McRT speedup
Native speedup
Application uses standard OpenMP
30
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
SEMPHY Speedup: DetailsSEMPHY Speedup: Details
McRT scheduler can provide the advantages of a task queue Better programmability
SEMPHY Speedup on McRT and Native Runtime
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16 18Number of processors
Spe
edup
ove
r 1P
Nat
ive
Exe
cutio
n Ti
me
McRT speedup
Native speedup
Application uses standard OpenMP
SEMPHY Speedup on McRT and Native Runtime
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16 18Number of processors
Spe
edup
ove
r 1P
Nat
ive
Exe
cutio
n Ti
me
McRT speedup
Native speedup
Application uses standard OpenMPSEMPHY speedup on McRT and Native Runtime
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16 18
Number of processors
Sp
eed
up
ove
r 1P
Nat
ive
Exe
cuti
on
T
ime
McRT speedup
Native speedup
Application uses Intel OpenMP task queue extension
SEMPHY speedup on McRT and Native Runtime
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16 18
Number of processors
Sp
eed
up
ove
r 1P
Nat
ive
Exe
cuti
on
T
ime
McRT speedup
Native speedup
Application uses Intel OpenMP task queue extension
31
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
OutlineOutline McRT overviewMcRT overview
McRT many-core simulation McRT many-core simulation
McRT on SMP systemsMcRT on SMP systems
McRT on sequestered core systemMcRT on sequestered core system
Architecture, challenges, and resultsArchitecture, challenges, and results
ConclusionsConclusions
32
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
Sequestered core StuffSequestered core Stuff See main part of presentationSee main part of presentation
33
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
McRT-Sequestered ResultsMcRT-Sequestered Results
Native: OpenMP on 8P SMP(all processors running Win 2003)McRT-OS: McRT on the same 8P SMP(all processors running Win 2003)McRT-BareM: McRT on the same 8P SMP(1P running Win 2003, 7P sequestered)
All speedups are relative to the
execution time for1P on the native
(OpenMP) runtime
K processor McRT-BareMetal mode has K-1 sequestered and 1 Win 2003 processor
Equake Speedup on McRT, Mcrt-BareMetal & Native (OpenMP) Runtime
0
2
4
6
8
10
12
0 2 4 6 8 10Number of processors
Sp
ee
du
p r
ela
tiv
e t
o 1
P
Na
tiv
e E
xe
cu
tio
n T
ime Native McRT-OS McRT-BareMetal
34
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
ConclusionsConclusions Provide a scalable many-core software environment Provide a scalable many-core software environment
Support multiple parallel programming modelsSupport multiple parallel programming models
Abstract away the execution platformAbstract away the execution platform
Good performance on SMP, sequestered system and simulation Good performance on SMP, sequestered system and simulation
Enhance many-core reliability and programmabilityEnhance many-core reliability and programmability
Transactional memoryTransactional memory
Software virtualized transactional memorySoftware virtualized transactional memory
Transactional data structures and algorithmsTransactional data structures and algorithms
Speculative and implicit parallelismSpeculative and implicit parallelism
35
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
CollaboratorsCollaborators Platform Architecture Research(PAR/MTL): McPLS simulatorPlatform Architecture Research(PAR/MTL): McPLS simulator
Architecture Research Lab(ARL/MTL): RMS workloadsArchitecture Research Lab(ARL/MTL): RMS workloads
PDSD (SSG): OpenMP libraryPDSD (SSG): OpenMP library
Doug Carmean, Eric Sprangle, Anwar Rohillah: TA simulatorDoug Carmean, Eric Sprangle, Anwar Rohillah: TA simulator
Streaming Media Lab (SMAL/MTL): Sequestered core Streaming Media Lab (SMAL/MTL): Sequestered core systemsystem
Network Architecture Lab (NAL/CTL): Packet processing Network Architecture Lab (NAL/CTL): Packet processing applicationsapplications
36
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
BackupBackup
37
• Intel Confidential – Internal Use Only •
Programming Systems LabProgramming Systems Lab
Nehalem Bonnell ComparisonNehalem Bonnell ComparisonNehalem (NHM) vs. Bonnell (BNL)
XviD-480p
0
1
2
3
4
5
6
7
8
1 101 201 301 401
Instructions executed (millions)
IPC
NHM 1 thr-1 core
BNL 4 thr-1 core
BNL 8 thr-2 core
BNL 16 thr-4 core
• Nehalem simulated with Skeleton• Bonnell simulated with TA• Instruction counts & execution phases line up nicely