Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Kernel-Level Support for Scalable Intra-Node Collective Communications
Hyun-Wook Jin and Joong-Yeon Cho System Software Laboratory
Dept. of Computer Science and Engineering Konkuk University [email protected]
1
Contents
• MPI intra-node communication
• Intra-node collective communication – MPI_Bast()
– MPI_Gather()
• Conclusions and future work
2
Multi/Many-Core Processors
Xeon 5100 Series
(Woodcrest)
Xeon 5500 Series
(Gainestown)
Xeon 5600 Series
(Westmere-EP)
Xeon E5-2600 Series
(Sandy Bridge-EP)
2 4 6 8
Xeon Phi X100 Series
(Knights Corner)
Xeon Phi 7200 Series
(Knights Landing)
61 72
Xeon E5-2600 v2
Series (Ivy Bridge-EP)
Xeon E5-2600 v3
Series (Haswell-EP)
Xeon E7 v4 Family
(Broadwell)
Xeon Platinum
Series (Skylake)
12 18 24 28
3
MPI Intra-Node Communication
• Loopback – NIC provides a
loopback path
– Two DMAs
• Shared memory – Communicate through a
memory area shared between MPI processes
– Two data copies
Processor i
Processor j
Memory
Send Buf
Recv Buf
Process A
Process B DMA
DMA
NIC
Memory
Send Buf
Recv Buf
Process A
Process B
Shared Memory
Copy
Copy
Processor i
Processor j
4
MPI Intra-Node Communication
• Memory mapping – Directly move a message from source to destination
buffer by means of kernel-level support
– Single data copy • Beneficial for large messages
Memory
Send Buf
Recv Buf
Process A
Process B
Direct Copy
Processor i
Processor j
5
Kernel-Level Support for MMapping
• LiMIC2 – Opened the era of one-copy intra-node communication
• H-W. Jin, S. Sur, L. Chai, and D. K. Panda, “LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster,” In Proc. of CPP-05, Jun. 2005.
• H.-W. Jin, S. Sur, Lei Chai, and D. K. Panda, "Lightweight Kernel-Level Primitives for High-Performance MPI Intra-Node Communication over Multi-Core Systems," In Proc. of IEEE Cluster 2007, Sep. 2007.
– LiMIC2-0.5 was publicly released with MVAPICH2-1.4RC1 (Jun. 2009)
– LiMIC2-0.5.6 is being released with the latest MVAPICH2 • mvapich2-src]$ ./configure --with-limic2 [omit other configure
options]
• mvapich2-src]$ mpirun_rsh -np 4 -hostfile ~/hosts MV2_SMP_USE_LIMIC2=1 [path to application]
6
Kernel-Level Support for MMapping
• CMA – In-kernel implementation + New system calls
• J. Vienne, “Benefits of Cross Memory Attach for MPI Libraries on HPC Clusters,“ In Proc. of XSEDE 14, Jul. 2014.
– Default intra-node communication channel for large messages in MVAPICH2
• XPMEM – Supports memory mapping to user-level address space
• B. Kocoloski and J. Lange, “XEMEM: Efficient Shared Memory for Composed Applications on Multi-OS/R Exascale Systems,” In Proc. of HPDC 2015, 2015.
7
Intra-Node Collective Communication
• MPI_Bcast() – Broadcasts a message from the root to all other
processes of the communicator • One-to-Many: Root -> Other processes
– MVAPICH2 (version 2.3) uses the collective-aware shared memory
• MPI_Gather() – Gathers together values from a group of processes
• Many-to-One: All processes -> Root
– MVAPICH2 (version 2.3) uses the kernel-level support (either CMA or LiMIC2) for large messages
8
MPI_BCAST
9
MPI_Bcast() in MVAPICH2 (v.2.3)
프로세스 B, C, …, N 프로세스 B, C, …, N
Collective-aware Shared Memory
Root Process Source Buffer
1. Copies 8KB data blocks to the shared memory (by the root process)
2. Copies 8KB data blocks to the destination buffer (by the other processes)
Other Processes Destination Buffer
10
How bad is LiMIC2 for MPI_Bcast()?
• Experimentally applied LiMIC2 instead of shared memory – Shows higher latency up to 548%
11
Why not to use LiMIC2 in MPI_Bcast()?
• What we expected…
P0 (Root) P1 P2 P3
MPI_Bcast()
Send Descriptor
Memory Mapping
Data Copy
Memory Unmapping
return
12
Why not to use LiMIC2 in MPI_Bcast()?
• What actually happened…
P0 (Root) P1 P2 P3
MPI_Bcast()
Send Descriptor
Memory Mapping (get_user_pages())
Data Copy
Memory Unmapping
return
13
MPI_Bcast() with LiMIC2-overlap
• The root performs memory mapping and the others reuse (share) the mapped area
P0 (Root) P1 P2 P3
MPI_Bcast()
Send Descriptor
Memory Mapping
Data Copy
Memory Unmapping return
14
Preliminary Measurement Results
• 20-core system – Intel Xeon Haswell
Deca-Core x 2
– LiMIC2-overlap reduces the latency up to 68%
• 120-core system – Intel Xeon IvyBridge
15-Core x 8
– LiMIC2-overlap reduces the latency up to 84%
15
What’s going on in MVAPICH2 Bcast?
프로세스 B, C, …, N 프로세스 B, C, …, N
Collective-aware Shared Memory
Root Process Source Buffer
1. Copies 8KB data blocks to the shared memory (by the root process)
2. Copies 8KB data blocks to the destination buffer (by the other processes)
Other Processes Destination Buffer
16
What’s going on in MVAPICH2 Bcast?
• Collective-aware shared memory
• LiMIC2-overlap
* Message size: 256KB * Some profiling overheads are included 17
What’s going on in MVAPICH2 Bcast?
• Data copy operations are not overlapped as much as expected
18
2
1
2
1
2
1
Block ID
Tim
e
MPI_GATHER
19
MPI_Gather() in MVAPICH2 (v.2.3)
Root Process (P0) Destination Buffer
Process P1
Source Buffer ∙∙∙
Process P2
Source Buffer
Process P(N-1)
Source Buffer
Intermediate Buffer
1. Allocates an intermediate buffer
2. Moves messages to the intermediate buffer via point-to-point communication
3. Copies the gathered messages to the destination buffer
20
Is it OK to use LiMIC2 in MPI_Gather()?
* Message size: 256KB * Some profiling overheads are included 21
Is it OK to use LiMIC2 in MPI_Gather()?
22
Is it OK to use LiMIC2 in MPI_Gather()?
23
Why not to use LiMIC2 in MPI_Gather()?
P0 (Root) P1 P2 P3
MPI_Gather()
Send Descriptor
return
Memory Mapping Data Copy
Memory Unmapping
Memory Mapping Data Copy
Memory Unmapping
Memory Mapping Data Copy
Memory Unmapping
Gather for P1
Gather for P2
Gather for P3
24
MPI_Gather() with LiMIC2-overlap
P0 (Root) P1 P2 P3
MPI_Gather()
return
Data Copy Memory Unmapping
Memory Mapping
Send Descriptor
25
Preliminary Measurement Results
• 20-core system – LiMIC2-overlap reduces
the latency up to 88%
• 120-core system – LiMIC2-overlap reduces
the latency up to 50%
– Different algorithms matter (e.g., binomial tree algorithm)
26
CONCLUSIONS
27
Concluding Remark
• Intra-node collective communication – MPI_Bcast()
• One-to-Many communication
• Implemented using collective-aware shared memory
– MPI_Gather() • Many-to-One communication
• Implemented using point-to-point
• LiMIC2-overlap – New interfaces
• Memory mapping reuse
• Flexibility of who can perform data copy
– 84% improvement for MPI_Bcast()
– 88% improvement for MPI_Gather()
28
Ongoing Work
• Other collectives – MPI_Scatter()
• LiMIC2-overlap reduces the latency up to 78% on the 20-core system
– MPI_Allgather()
• LiMIC2-overlap reduces the latency up to 38% on the 20-core system
29
Ongoing Work
• Overlapping between collective communication and computation
30
P0 (Root) P1 P2 P3
MPI_Bcast()
return
Com
puta
tion
P0 (Root) P1 P2 P3
MPI_Bcast()
sync
return
Com
puta
tion
Future Work
LiMIC3
31
ParaMo 2019
• The 1st International Workshop on Parallel Programming Models in High-Performance Cloud – Co-located with Euro-Par 2019
– Date: August 26, 2019
– Venue: Göttingen, Germany
32
Thank You!
33