Upload
ambrose-nelson
View
217
Download
4
Embed Size (px)
Citation preview
Embedded System Lab.
Embedded System Lab.
72151691 김해천[email protected]
Thread and Memory Placement on NUMA Systems:
Asymmetry Matters
김 해 천
Embedded System Lab.
Index Introduction
NUMA Modern OS, Thread load balancing Asymmetry arch
The Impact of Interconnect Asymmetry on Performance New thread and memory placement
Algorithm Evaluation
김 해 천
Embedded System Lab.
NUMA Non-Uniform-Memory-Access
The latency of data access depends on where the data is located The placement of threads and memory plays a crucial role in performance
NUMA-aware algorithms for OS
Proces-sor
Proces-sor
Proces-sor
Memory
Memory
Memory
Memory
Proces-sor
김 해 천
Embedded System Lab.
Modern OS Modern OS aim to reduce the number of hops used for thread-to-thread and thread-to-memory
These techniques assume that the interconnect between nodes is symmetric same bandwidth, same latency
Node
load load load load
Node Node Node
load load load load
<Linux’s load balancing>
First, same node more hops apart
김 해 천
Embedded System Lab.
Asymmetry architecture AMD Bulldozer NUMA machine : Asymmetry of interconnect links
Links have different bandwidths: some are 16-bit wide, some are 8-bit wide Some links can send data faster in one direction than in the other Links are shared differently Some links are unidirectional
Eight nodes(each hosting eight cores)
김 해 천
Embedded System Lab.
Asymmetry
김 해 천
Embedded System Lab.
The Impact of Interconnect Asymmetry on Perfor-mance
Test of asymmetry Each application runs with 24 threads : three nodes, 336 subset
Depending on the choice of node, the performance is totally different
Performance differences are caused by the asymmetry of the interconnect between the nodes
<Figure 2. Performance difference between the best, and worst thread placement>
김 해 천
Embedded System Lab.
The Impact of Interconnect Asymmetry on Perfor-mance
To explain the reasons about the performance reported in Figure 2 Figure 3 shows the memory latency measured when application runs on all 336 possible subsets
Figure 2 is affected by the highest difference in the memory latencies(Figure 3)
<Figure 3. Difference in latency of memory accesses between the best, and worst thread placement>
김 해 천
Embedded System Lab.
The Impact of Interconnect Asymmetry on Perfor-mance
To further understand the cause of very high latencies on “bad” configuration Run streamcluster with 16 threads on two nodes
Performance is correlated with the latency of memory accesses The latency of memory accesses is not correlated with the number of hops The latency of memory accesses is actually correlated with the bandwidth between the nodes
<Table 1. Performance of streamcluster executing with 16 threads on 2 nodes >
김 해 천
Embedded System Lab.
New thread and memory placement Efficient online measurement of communication patterns is challenging
Changing the placement of threads and memory may incur high overhead
Accommodating multiple applications simultaneously is challenging
Selecting the best placement is combinatorically difficult
김 해 천
Embedded System Lab.
Solution Algorithm
AsymSched relies on 3 components
ComputeSalientMetrics
Periodicallycompute
the best thread placement
Migratesthread and
memory
Measurement component
Decision component
Migration component
김 해 천
Embedded System Lab.
Algorithm Measurement
AsymSched continuously gathers the metrics characterizing the volume of CPU-to-CPU and CPU-to-Memory communication
For detecting which thread share data CPU-to-CPU: the accesses to cached data CPU-to-Memory: the accesses to the data located in RAM
Counter
cpu cpu
Memory
cpu
김 해 천
Embedded System Lab.
Algorithm Decision
Clusters with the highest weights will be scheduled on the nodes with the best connectivity
Thread Thread
Thread
Share Data A
Cluster A
Assign a weight
Thread Thread
Thread
Cluster B
Share Data B
김 해 천
Embedded System Lab.
Algorithm Decision
AsymSched computes possible placements for all the clusters A placement is an array mapping clusters to nodes
This number is very large, It is important that AsymSched not test all possible placement When a application uses two nodes, we only consider 16-bit link Configuration of nodes with same bandwidth is allocated in same hash
Pwbw, the weighted bandwidth of P, defined as
Cluster
Node 1
Node 2
Node 3
Node n
Pn 3
김 해 천
Embedded System Lab.
Algorithm Migration
Asymsched migrate threads using system call Dynamic migration to migrate the subset of pages
Thread
Thread
Thread
Node 1
Node n
Thread
After 2 seconds
Still performs more than 90% of its memory access
AoldA > 90%Full memory migration
AoldA < 90%Dynamic migration
Memory
김 해 천
Embedded System Lab.
Evaluation
Single application workloads AsymSched always performs close to the best static thread placement Without the migration of threads is not sufficient to achieve the best performance
performance close to average, high standard deviation
<Figure 4. Performance difference >
김 해 천
Embedded System Lab.
Evaluation Multi application workloads
AsymSched achieves performance that is close or better than the best static thread placement Produces a vey low standard deviation
<Figure 4. Performance difference >
김 해 천
Embedded System Lab.
Conclusion Asymmetry of the interconnect drastically impacts performance
The bandwidth between nodes is more important than the distance
Asymsched, a new thread and memory placement algorithm maximize the bandwidth
The number of nodes in NUMA systems increases, the interconnect is less likely to remain symmetric Asymsched design principles will be of growing importance in the future