Performance Optimizations for NUMA-Multicore Systems

Performance Optimizationsfor NUMA-Multicore Systems

Zoltán Majó

Department of Computer ScienceETH Zurich, Switzerland

About me

ETH Zurich: research assistant Research: performance optimizations Assistant: lectures

TUCN Student Communications Center: network engineer Department of Computer Science: assistant

Computing

Unlimited need for performance

Performance optimizations

One goal: make programs run fast

Idea: pick good algorithm Reduce number of operations executed Example: sorting

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

n^2 n*log(n)

Input size (n)

Execution time [T]

Number of operations

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

Polynomial (n^2)

Column1

Input size (n)

Execution time [T]

Number of operationsbubble so

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

Polynomial (n^2)

n*log(n)

Input size (n)

Execution time [T]

bubble sort

quicksort

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

Polynomial (n^2)

n*log(n)

Input size (n)

Execution time [T]

bubble sort

quicksort

11Xfaster

Sorting

We picked good algorithm, work done

Are we really done?

Make sure our algorithm runs fast Operations take time We assumed 1 operation = 1 time unit T

Quicksort performance

1 2 3 4 5 6 7 8 9 10 11 120

Input size (n)

Execution time [T]

1 op = 1 T

1 2 3 4 5 6 7 8 9 10 11 120

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 2 3 4 5 6 7 8 9 10 11 120

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

1 2 3 4 5 6 7 8 9 10 11 120

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

1 op = 8 T

1 2 3 4 5 6 7 8 9 10 11 120

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

1 op = 8 T

bubble sort (

1 op = 1 T)

32%faster

Latency of operations

Best algorithm not enough

Operations are executed on hardware

Stage 1:Dispatchoperation

Stage 2:Executeoperation

Stage 3:Retireoperation

Latency of operations

Best algorithm not enough

Operations are executed on hardware

Hardware must be used efficiently

Stage 1:Dispatchoperation

Stage 2:Executeoperation

Stage 3:Retireoperation

Outline

Introduction: performance optimizations

Cache-aware programming

Scheduling on multicore processors

Using run-time feedback

Data locality optimizations on NUMA-multicores

Conclusion

ETH scholarship

Memory accessesCPU

230 cycles access latency

Memory accesses

Total access latency = ?Total access latency = 16 x 230 cycles = 3680 cycles

Caching

CachingCPU

Block size:

CachingCPU

Block size:

CachingCPU

Block size:

CachingCPU

Block size:

CachingCPU

Block size:

Hits and missesCPU

Cache miss: data not in cache = 230 cyclesCache hit: data in cache = 30 cycles

Total access latencyCPU

Total access latency = ?Total access latency = 4 misses + 12 hits= 4 x 230 cycles + 12 * 30 cycles = 1280 cycles

Benefits of caching

Comparison Architecture w/o cache: T = 230 cycles Architecture w/ cache: Tavg = 80 cycles → 2.7X

improvement

Do caches always help? Can you think of access pattern with bad cache usage?

CachingCPU

Block size:

Today’s example: matrix-matrix multiplication (MMM)

Number of operations: n3

Compare naïve and optimized implementation Same number of operations

MMM: naïve implementation

for (i=0; i<N; i++)for (j=0; j<N; j++) {

sum = 0.0;for (k=0; k < N; k+

+)sum += A[i]

[k]*B[k][j];C[i][j] = sum;

A B= Xj

MMMCPU

Cache hits Total accesses

A[][]B[][]

MMMCPU

A[][]B[][]

MMMCPU

A[][]B[][]

MMMCPU

A[][]B[][]

MMMCPU

A[][]B[][]

MMM: Cache performance

Hit rate Accesses to A[][]: 3/4 = 75% Accesses to B[][]: 0/4 = 0% All accesses: 38%

Can we do better?

Cache-friendly MMM

Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj)

for (i=0; i<N; i++)for (j=0; j<N; j++) {

sum = 0.0;for (k=0; k <

N; k++)sum +=

A[i][k]*B[k][j];C[i][j] += sum;

for (i=0; i<N; i++)for (k=0; k<N; k++) {

r = A[i][k];for (j=0; j <

N; j++)C[i][j]

+= r*B[k][j];}

C A B= Xk

MMMCPU

C[][]B[][]

Cache-friendly MMM

Cache-unfriendly MMM (ijk)

A[][]: 3/4 = 75% hit rate

B[][]: 0/4 = 0% hit rate

All accesses: 38% hit rate

Cache-friendly MMM (ikj)

C[][]: 3/4 = 75% hit rate

B[][]: 3/4 = 75% hit rate

All accesses: 75% hit rate

Better performance due to cache-friendliness?

512 1024 2048 4096 81920.01

ijk (cache-unfriendly) ikj (cache-friendly)

Matrix size

Performance of MMMExecution time [s]

512 1024 2048 4096 81920.01

Matrix size

Performance of MMMExecution time [s]

Two versions of MMM: ijk and ikj Same number of operations (~n3) ikj 20X better than ijk

Good performance depends on two aspects Good algorithm Implementation that takes hardware into account

Hardware Many possibilities for inefficiencies We consider only the memory system in this lecture

Outline

Conclusions

ETH scholarship

Cache-based architecture

Bus Controller

CacheL2 Cache

Memory Controller

Processor package

CPUCore

Multi-core multiprocessor

Bus Controller

CacheL2 Cache

Memory Controller

L2 Cache

Processor package

CPUCore

Bus Controller

CacheL2 Cache

L2 Cache

Memory Controller

Experiment

Performance of a well-optimized program soplex from SPECCPU 2006

Multicore-multiprocessor systems are parallel Multiple programs run on the system simultaneously Contender program: milc from SPECCPU 2006

Examine 4 execution scenarios

soplex

Execution scenariosProcessor 0

L2 Cache

CPUCore

Bus Controller

Memory Controller

L2 Cache

Processor 1

Bus Controller

CacheL2 Cache

L2 Cache

Memory Controller

soplex milc

L2 Cache

CPUCore

Bus Controller

Memory Controller

L2 Cache

Processor 1

Bus Controller

CacheL2 Cache

L2 Cache

Memory Controller

soplex milc

Performance with sharing: soplex

0.00.40.81.21.62.0

Execution time relative to solo execution

0.00.40.81.21.62.0

Resource sharing

Significant slowdowns due to resource sharing

Why is resource sharing so bad?Example: cache sharing

L1 Cache

Cache sharingCore Coresoplex milc

L1 Cache

Cache sharingCore Coresoplex milc

Resource sharing

Does resource sharing affect all programs? So far: we considered at the performance of under contention Let us consider a different program:

soplex

Performance with sharing

0.00.40.81.21.62.0

soplexnamd

Performance with sharing

0.00.40.81.21.62.0

soplexnamd

Resource sharing

Significant slowdown for some programs affected significantly affected less

What do we do about it?

Scheduling can help Example workload:

soplex

soplex soplexsoplexsoplex

namd namd namd namd

L2 Cache

CPUCore

Bus Controller

Memory Controller

L2 Cache

Processor 0

Bus Controller

CacheL2 Cache

L2 Cache

Memory Controller

soplex soplex soplex soplex namdnamd namd namd

L2 Cache

CPUCore

Bus Controller

Memory Controller

L2 Cache

Processor 0

Bus Controller

CacheL2 Cache

L2 Cache

Memory Controller

soplex soplex namd namdsoplex namdsoplex namd

Challenges for a scheduler

Programs have different behaviors

Behavior not known ahead-of-time vs.

Behavior changes over time

soplex namd

Single-phased program

Program with multiple phases

Outline

Conclusions

ETH scholarship

Hardware performance counters

Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation

In the past: undocumented feature

Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst

Programming performance counters

Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode)

perf_events interface Standard Linux interface since Linux 2.6.31 UNIX philosophy: performance counters are files

Simple API: Set up counters: perf_event_open() Read counters as files

int main() {

int pid = fork();

if (pid == 0) {

exit(exec(“./my_program”, NULL));

} else {

int status; uint64_t value;

int fd = perf_event_open(...);

waitpid(pid, &status, NULL);

read(fd, &value, sizeof(uint64_t);

printf(”Cache misses: %”PRIu64”\n”, value);

Example: monitoring cache misses

perf_event_open()

Looks simpleint sys_perf_event_open(

struct perf_event_attr *hw_event_uptr,

pid_t pid,

int cpu,

int group_fd,

unsigned long flags

struct perf_event_attr {__u32 type;__u32 size;__u64 config;union {

__u64 sample_period;__u64 sample_freq;

};__u64 sample_type;__u64 read_format;__u64 inherit;__u64 pinned;__u64 exclusive;__u64 exclude_user;__u64 exclude_kernel;__u64 exclude_hv;__u64 exclude_idle;__u64 mmap;

libpfm

Open-source helper library

user programlibpfm perf_events

(1) event name

(2) set up perf_event_attr

(3) call perf_event_open()

(4) read results

Example: measure cache misses for MMM

Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture

Look up event needed Source: Intel Architectures Software Developer's Manual

Software Developer’s Manual

Example: measure cache misses for MMM

Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture

Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM

512 1024 2048 4096 819210000

100000

1000000

10000000

100000000

1000000000

10000000000

100000000000

1000000000000

Matrix size

MMM cache misses# cache misses x 106

Single-phased program

set up performance counters read performance counters

Program with multiple phases

set up performance counters

get sample

Membus: multicore scheduler

1. Dynamically determine program behavior Measure # of loads/stores that cause memory traffic Hardware performance counters in sampling mode

2. Determine optimal placement based on measurements

Evaluation

Workload with 8 processes lbm, soplex, gromacs, hmmer from SPEC CPU 2006 Two instances of each program

Experimental results

Evaluation

lbm soplex gromacs hmmer Average0.0

Default LinuxMembus

Evaluation

Default LinuxMembus

Evaluation

Default LinuxMembus

Summary: multicore processors

Resource sharing critical for performance

Membus: a scheduler that reduces resource sharing

Question: why wasn’t Membus able to improve more?

Memory controller sharingProcessor 0

L2 Cache

CPUCore

Bus Controller

Memory Controller

L2 Cache

Processor 1

Bus Controller

CacheL2 Cache

L2 Cache

Memory Controller

soplex soplex namd namdnamd soplexnamd soplex

Memory Controller

Non-uniform memory architectureProcessor 0

L2 Cache

CPUCore

Bus Controller

L2 Cache

Processor 1

Bus Controller

CacheL2 Cache

L2 Cache

RAM RAM

Non-uniform memory architectureProcessor 0

L2 Cache

CPUCore

Memory Ctrl

L2 Cache

Processor 1

Memory Ctrl

CacheL2 Cache

L2 Cache

RAM RAM

Interconnect Interconnect

Outline

Conclusions

ETH scholarship

Non-uniform memory architecture

Processor 1

Core 4 Core 5

Core 6 Core 7

Processor 0

Core 0 Core 1

Core 2 Core 3

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

Processor 0

Core 0 Core 1

Core 2 Core 3

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

Processor 0

Core 0 Core 1

Core 2 Core 3

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

First-touch page placement policy

Processor 1

Processor 0

First-touch page placement policy

Processor 1

Processor 0

Automatic page placement

First-touch page placement Often high number of remote accesses

Data address profiling Profile-based page-placement Supported by hardware performance counters many architectures

Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]

Processor 1

Processor 0

Profile P0 : accessed 1000 times by

P1 : accessed3000 times by

Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

25%Performance improvement over first-touch [%]

Performance improvement over first-touch in some cases No performance improvement in many cases

Inter-processor data sharing

Processor 1

Processor 0

P1 : accessed 3000 times by

accessed 5000 times by

P2: inter-processor shared

Processor 1

Processor 0

accessed 5000 times by

P2: inter-processor shared

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Profile-based page placement often ineffective Reason: inter-processor data sharing

Inter-processor data sharing is a program property

We propose program transformations No time for details now, see results

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

Conclusions

Performance optimizations Good algorithm + hardware-awareness Example: cache-aware matrix multiplication

Hardware awareness Resource sharing in multicore processors Data placement in non-uniform memory architectures

A lot remains to be done...

...and you can be part of it!

ETH scholarship for masters students...

...to work on their master thesisIn the Laboratory of Software Technology

Prof. Thomas R. GrossPhD. Stanford University, MIPS project, supervisor John L.

HennessyCarnegie Mellon: Warp, iWarp, Fx projects

ETH offers to you Monthly scholarship of CHF 1500– 1700 (EUR 1200–1400) Assistance with finding housing Thesis topic

Possible Topics

Michael Pradel: Automatic bug finding

Luca Della Toffola: Performance optimizations for Java

Me: Hardware-aware performance optimizations

Call Graph

A B C D E

Memory

… …

OO code positioning

Call Graph

Profiling Hot Path

Call Graph

A B C D E

Memory

… …

Call Graph

A B CDE

Memory

… …

• JVM• No Profiling• Constructors

• Linked list traversal• Looking for the youngest/oldest person

Person

surname

Person

surname

Person

surname

Person

surname

next name agesurname

next name agesurname next name agesurname

next name agesurname

next name agesurname next name agesurname

next age next age next age next age

Aa1: 10a2: 100

a3: 1000

a4: 30

a5: 2000

Class$Cold

A$Colda1a2

hot field

cold field

Profiling Splitting

# field accesses

• Jikes RVM• Splitting strategies• Garbage collection optimizations• Allocation optimizations

If interested and motivated

Apply @ Prof. Rodica Potolea Until August 2012

Come to Zurich Start in February 2013 Work 4-6 months on the thesis

If you have questions Send e-mail to me zoltan.majo@inf.ethz.ch Talk to Prof. Rodica Potolea

Thank you for your attention!

Performance Optimizations for NUMA-Multicore Systems

Documents

Parallelization & Multicore€¦ · Parallel programming ... NUMA-aware Usually: NUMA memory allocated on “first-touch” basis. Memory Allocation in Threaded Environments Heap

Understanding Concurrency, Performance Optimizations, and … · Understanding Concurrency, Performance Optimizations, and Debugging for Multicore Platforms Multicore Programming

Multicore Processing, Virtualization, and Containerization · • Multicore Processing (MCP) – via multicore processors • Virtualization (V) – via virtual machines (VMs) •

Loop Optimizations

Computer-System Organization Computer-System …abc/teaching/bbm342/... · · 2017-02-16A Dual-Core Design UMA and NUMA architecture variations Multi-chip and multicore Systems

Lec08 optimizations

Architectural Optimizations

Interconnect Optimizations

1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++

Vector Optimizations

INFORMATIQUE Ordonnancement d'un DAG avec transfert des ... · Ordonnancement d'un DAG avec transfert des données sur une plateforme NUMA multicore Présentée par Slimane Mohammed

Using Multicore Navigator Multicore Applications

Developing Optimizations

Local Optimizations

A Case for NUMA-aware Contention Management on Multicore ... · A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov Simon Fraser University Sergey Zhuravlev

Memory System Performance in a NUMA Multicore Multiprocessor

Global optimizations

Multicore Processors and GPUs: Programming Models and ...1. Multicore Processors and GPUs: Programming Models and Compiler Optimizations. J. “Ram” Ramanujam P. “Saday” Sadayappan

Optimizing CUDA. 2 © NVIDIA Corporation 2008 Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations

NUMA-aware Multicore Matrix Multiplicationeditorial.worldscinet.com/ppl/editorial/paper/903437.pdf · 2.2. Matrix Multiplication Asymptotic and Practical Complexities A study [7]