Embedded System Lab. 72151691 김해천 [email protected] Thread and Memory Placement on NUMA Systems: Asymmetry Matters

Embedded System Lab.


72151691 김해천[email protected]

Thread and Memory Placement on NUMA Systems:

Asymmetry Matters

김 해 천


Index Introduction

NUMA Modern OS, Thread load balancing Asymmetry arch

The Impact of Interconnect Asymmetry on Performance New thread and memory placement

Algorithm Evaluation

김 해 천


NUMA Non-Uniform-Memory-Access

The latency of data access depends on where the data is located The placement of threads and memory plays a crucial role in performance

NUMA-aware algorithms for OS

Proces-sor

Proces-sor

Proces-sor

Memory

Memory

Memory

Memory

Proces-sor

김 해 천


Modern OS Modern OS aim to reduce the number of hops used for thread-to-thread and thread-to-memory

These techniques assume that the interconnect between nodes is symmetric same bandwidth, same latency

Node

load load load load

Node Node Node

load load load load

<Linux’s load balancing>

First, same node more hops apart

김 해 천


Asymmetry architecture AMD Bulldozer NUMA machine : Asymmetry of interconnect links

Links have different bandwidths: some are 16-bit wide, some are 8-bit wide Some links can send data faster in one direction than in the other Links are shared differently Some links are unidirectional

Eight nodes(each hosting eight cores)

김 해 천


Asymmetry

김 해 천


The Impact of Interconnect Asymmetry on Perfor-mance

Test of asymmetry Each application runs with 24 threads : three nodes, 336 subset

Depending on the choice of node, the performance is totally different

Performance differences are caused by the asymmetry of the interconnect between the nodes

<Figure 2. Performance difference between the best, and worst thread placement>

김 해 천



To explain the reasons about the performance reported in Figure 2 Figure 3 shows the memory latency measured when application runs on all 336 possible subsets

Figure 2 is affected by the highest difference in the memory latencies(Figure 3)

<Figure 3. Difference in latency of memory accesses between the best, and worst thread placement>

김 해 천



To further understand the cause of very high latencies on “bad” configuration Run streamcluster with 16 threads on two nodes

Performance is correlated with the latency of memory accesses The latency of memory accesses is not correlated with the number of hops The latency of memory accesses is actually correlated with the bandwidth between the nodes

<Table 1. Performance of streamcluster executing with 16 threads on 2 nodes >

김 해 천


New thread and memory placement Efficient online measurement of communication patterns is challenging

Changing the placement of threads and memory may incur high overhead

Accommodating multiple applications simultaneously is challenging

Selecting the best placement is combinatorically difficult

김 해 천


Solution Algorithm

AsymSched relies on 3 components

ComputeSalientMetrics

Periodicallycompute

the best thread placement

Migratesthread and

memory

Measurement component

Decision component

Migration component

김 해 천


Algorithm Measurement

AsymSched continuously gathers the metrics characterizing the volume of CPU-to-CPU and CPU-to-Memory communication

For detecting which thread share data CPU-to-CPU: the accesses to cached data CPU-to-Memory: the accesses to the data located in RAM

Counter

cpu cpu

Memory

cpu

김 해 천


Algorithm Decision

Clusters with the highest weights will be scheduled on the nodes with the best connectivity

Thread Thread

Thread

Share Data A

Cluster A

Assign a weight

Thread Thread

Thread

Cluster B

Share Data B

김 해 천


Algorithm Decision

AsymSched computes possible placements for all the clusters A placement is an array mapping clusters to nodes

This number is very large, It is important that AsymSched not test all possible placement When a application uses two nodes, we only consider 16-bit link Configuration of nodes with same bandwidth is allocated in same hash

Pwbw, the weighted bandwidth of P, defined as

Cluster

Node 1

Node 2

Node 3

Node n

Pn 3

김 해 천


Algorithm Migration

Asymsched migrate threads using system call Dynamic migration to migrate the subset of pages

Thread

Thread

Thread

Node 1

Node n

Thread

After 2 seconds

Still performs more than 90% of its memory access

AoldA > 90%Full memory migration

AoldA < 90%Dynamic migration

Memory

김 해 천


Evaluation

Single application workloads AsymSched always performs close to the best static thread placement Without the migration of threads is not sufficient to achieve the best performance

performance close to average, high standard deviation

<Figure 4. Performance difference >

김 해 천


Evaluation Multi application workloads

AsymSched achieves performance that is close or better than the best static thread placement Produces a vey low standard deviation

<Figure 4. Performance difference >

김 해 천


Conclusion Asymmetry of the interconnect drastically impacts performance

The bandwidth between nodes is more important than the distance

Asymsched, a new thread and memory placement algorithm maximize the bandwidth

The number of nodes in NUMA systems increases, the interconnect is less likely to remain symmetric Asymsched design principles will be of growing importance in the future

Documents

Embedded System Lab. 72151691 김해천 [email protected] Thread and Memory Placement on NUMA Systems: Asymmetry Matters