[IEEE 2011 IEEE Conference on Open Systems (ICOS) - Langkawi, Malaysia (2011.09.25-2011.09.28)] 2011 IEEE Conference on Open Systems - MPI communication benchmarking on intel Xeon

MPI Communication Benchmarking on Intel Xeon Dual Quad-Core Processor Cluster

Roswan Ismail 1, 2

, Nor Asila Wati Abdul Hamid 2, Mohamed Othman 2, Rohaya Latip 2, Mohd Azizi Sanwani 3

1 Faculty of Art, Computing & Creative Industry, Universiti

Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia

[email protected]

2 Faculty of Computer Science & Information Technology, Universiti

Putra Malaysia, Selangor Darul Ehsan, Malaysia

{asila, mothman, rohaya} @fsktm.upm.edu.my

3 Centre for Diploma Programme, Multimedia University,

Cyberjaya, Selangor, Malaysia [email protected]

Abstract—This paper reports the measurements of MPI communication benchmarking on Khaldun cluster which ran on Linux-based IBM Blade HS21 Servers with Intel Xeon dual quad-core processor and Gigabit Ethernet interconnect. The measurements were done by using SKaMPI and IMB benchmark programs. Significantly, these were the first results produced by using SKaMPI and IMB to analyze the performance of Open MPI implementation on Khaldun cluster. The comparison and analysis of the results of point to point and collective communication from these two benchmark programs were then provided. It showed that different MPI benchmark programs rendered different results since they used different measurement techniques. The results were then compared to the experiment’s results that were done on cluster with Opteron dual quad-core processor and Gigabit Ethernet interconnect. The analysis indicated that the architecture of machines used also affected the results.

Keywords-MPI benchmarks; parallel computer; IBM Blade HS21 Server; multi-core; SKaMPI; IMB

I. INTRODUCTION Nowadays, the requirement for powerful and faster

computer to solve numerical problems especially for areas requiring great computational speed is crucial. Some examples of such problems are numerical simulation of scientific and engineering problems. This type of problems typically necessitates repetitive calculations on large amount of data and should be accomplished within a reasonable time period.

For many years, parallel computing has been considered as a way to increase computational speed. As opposed to sequential, parallel computers allow more than one processor to run concurrently to solve a large problem which leads to the problem being solved considerably faster. In addition, recent technological breakthrough enabled the production of multi-core processors to further increase the processing capability. These features can usually be found on clusters used in the industry, research facilities and academia.

Most parallel programs that run on these clusters use Message Passing Interface (MPI) for communicating data between the nodes. Consequently, the analysis and evaluation of the MPI routines on these clusters are vital. This paper

discuss the results of MPI communication performances on Khaldun cluster acquired from SKaMPI and IMB. The results from both applications will then be compared and analyzed for verification. The results would be useful for users of BIRUNI GRID and researchers who are interested to experiment further in the performance analysis of MPI implementation on multi-core architecture and also on Open MPI library.

II. RELATED WORKS There have been many research studies focusing on

performance evaluation, analysis, and optimization of newly developed High Performance Cluster (HPC). Previous works provided performance analysis on different type of machines such as AlphaServer SC [1], Cray T3E/900 and IBM RS 6000 SP [2]. All the above studies were done on multi-processor (single core) nodes whereas this study was done on the more advanced processor technology (quad-core).

Other related works provided performance evaluation on

clusters with ccNUMA nodes [3, 4, 5] and multi-core architecture such as dual-core Opteron nodes [6, 7, 8, 9], quad-core Opteron nodes [10], and quad-core Cray XT platform [11]. It will be useful to observe the performance comparison of MPI routines on the above studies due to the different architecture involved (Opteron versus Xeon).

There were also studies on the performance analysis of MPI communication on commodity Linux cluster using Fast Ethernet and Myrinet network [12, 13]. Additionally, [13] also demonstrated the customization of MPICH codes to enable value to be set in order to determine the optimum change-over point algorithm selection for collective communication algorithm used in MPICH.

Unlike the previous related works, the work presented in this article provides the measurement of the MPI communication performance on cluster with Intel Xeon dual quad-core nodes with Gigabit Ethernet interconnect. Moreover, as most parallel programs used on HPCs use MPI for nodes communication, it is imperative that the performance of MPI routines on HPC Clusters of BIRUNI GRID to be documented,

2011 IEEE Conference on Open Systems (ICOS2011), September 25 - 28, 2011, Langkawi, Malaysia

978-1-61284-931-7/11/$26.00 ©2011 IEEE 208

evaluated and analyzed for future improvement. In this study, the MPI implementation utilized the Open MPI version 1.3.3.

III. EXPERIMENTS ON KHALDUN

A. MPI Benchmark Programs Used There are several benchmark programs that can be used to

measure the performance of MPI on parallel supercomputers. The most commonly used MPI benchmark programs are SKaMPI [14], Mpptest [15], IMB [16], MPBench [17] and the most recently developed, MPIBench [18]. However, this paper would only discuss the results acquired for MPI communication on Khaldun cluster from SKaMPI and IMB due to the excellent documentation available for both applications [14, 16].

B. Khaldun Cluster Architecture The experiments were conducted on Khaldun cluster, one

of three HPC Clusters of BIRUNI GRID. BIRUNI GRID was a project commissioned by UPM with the hope of making it as part of the HPC clusters for A-Grid [19]. The project which started in 2008 and funded by EuAsiaGrid was developed and managed by Infocomm Development Centre (iDEC) of UPM. It was fully configured and deployed by UPM Grid Team. The only part done by the supplier were hardware racking and during the initial power up stage.

Figure 1 represents the deployment scheme for the

Khaldun cluster which consisted of six worker nodes. Each node had two Intel Xeon quad-core processors E5405, 2 GHz with 8 GB RAMs. All nodes were connected together using switch employing star topology. The inter-node interconnect for Khaldun was performed using Gigabit Ethernet with maximum data transfer of 1 GB/s (full duplex). The detail configurations for Khaldun are listed on Table 1 while Figure 2 and 3 represent the block diagram of Intel Xeon quad-core processor E5405.

Figure 1. Khaldun Cluster Deployment [19]

TABLE I. KHALDUN CONFIGURATIONS [19]

Number of nodes 6 Machine IBM Blade HS21 Servers CPU 2 x Intel Xeon Quad Core 2 GHz

Processors (8 cores per node) RAM 8 GB Storage capacity Each node has 2 x 147 GB (only 1 x 147

GB opened, the rest is reserved for future use (multilayer grid))

O.S Scientific Linux 5.4 64 bit Compiler gcc compiler Interconnect Gigabit Ethernet Switch MPI Implementation OPEN-MPI-1.3.3

Figure 2. Block Diagram of Intel® Xeon® Processor E5405 (I) [20]

Figure 3. Block Diagram of Intel® Xeon® Processor E5405 (II) [20]

IV. METHODOLOGY The experiments involved installation and functionality

tests of SKaMPI and IMB applications on Khaldun cluster. Common procedures applied for all tests such as the size of data, type of MPI routines and identical number of iterations in order to standardize the experiments. All tests for both applications were run multiple times to ensure that the results obtained were consistent. Any abnormalities observed were scrutinized and experiments retested to rule out any external factors that might affected the results.

Before measurements were taken, the data sizes were set from 4 bytes up to 4 MB. The number of repetitions for MPI operations was set to 1000 as a default setting for IMB. MPI routines selected to be measured and reported in this article


209

were MPI_Send/MPI_Recv, MPI_Sendrecv, MPI_Bcast, MPI_Alltoall, MPI_Scatter and MPI_Gather.

All measurements on the experiments were run with

exclusive access to the corresponding nodes, thus there was no external process that could affect the results. The total nodes used were up to four nodes since the measurements were tested on 2, 4, 8, 16 and 32 out of 48 available cores. The two remaining nodes were available to other process. The data obtained from the experiment were then recorded and analyzed.

A. Communication Method Both SKaMPI and IMB have different communication

pattern [3, 12, 14]. Figure 4 and 5 represent point to point communication method of intra-node and inter-node communication respectively.

Green line on each figure represents the core selected by IMB as communicating partner to core 0 while red line refers to selected core if the measurements were done using SKaMPI. Meanwhile, blue line refers to possible measurements taken by IMB and SKaMPI in order to determine the fastest and the slowest core to be selected as the partner of the sender core to communicate with. In this case, SKaMPI performed a short test on all cores to find which core has the slowest communication with sender core while IMB did the opposite by finding the fastest.

As IMB default point to point communication pattern is to find the fastest core to communicate with, it posed problem for accurate measurements as send and receive operation using IMB would always occur as intra-node. Therefore, to measure communication time between cores on different nodes, the location of the core need to be specified in order to force the sender core to communicate with the core on the other node. This was done for point to point communication on 16 and 32 cores by using PBS command file option.

Figure 4. Intra-Node Communication Method on 8 cores

Figure 5. Inter-Node Communication Method on 16 cores

V. RESULTS

A. Point to Point Communication SKaMPI uses Pingpong_Send_Recv and

Pingpong_SendRecv function to measure point to point communication [14]. It returns the average time needed for one full message roundtrip i.e. it returns time for one ‘ping’ plus time for one ‘pong’. In contrast with IMB which returns only half of the time. Accordingly, the latency results of SKaMPI were calculated by dividing the roundtrip time into half before it can be compared with IMB.

• Send/Receive

Figure 6 represents the average time for point to point communication on 16 cores on Khaldun cluster. The result depicts Gigabit Ethernet latency with some additional overhead. IMB obtained the lowest results as compared to SKaMPI since it just measured communication time between cores on the same node.

Meanwhile, SKaMPI by default measured the communication time on different node since it chose the node with the slowest communication time to be paired with the root processor. However, after the location of communicating nodes was specified, the results from IMB altered and became fairly identical to the results from SKaMPI.

Figure 6. Comparison between SKaMPI and IMB for point to point

(send/receive) communication on 16 cores on Khaldun


210

Figure 7 shows SKaMPI results for latency on different number of cores on Khaldun. It indicates that communication time for send/receive operation increased consistently over message length. The SKaMPI results for 2, 4 and 8 cores were the lowest since it just measured intra-node communication and produced less overhead as compared to 16 and 32 cores which involved inter-node communication.

Figure 7. Communication Performance results from SKaMPI for MPI_Send/MPI_Recv on different number of cores on Khaldun

For comparison, these results were then compared to the

results from experiments that were done on a cluster with 2.3 Ghz Opteron processors [10]. Each node had 16 GB memory with dual quad-core processors and was configured with Gigabit Ethernet network connected to a Cisco switch.

Table 2 presents the comparison of latency and bandwidth

of point to point communication of different processor architecture. While the result was comparable, it shows that latency for send/receive communication on Khaldun (Xeon processor) was marginally higher than Barcelona (Opteron processor) with around 46.54 µs as compared to 46.52 µs respectively. Accordingly, the bandwidth result obtained by Khaldun was slightly lower than Barcelona which was around 111.57 MB/s as compared to 112.5 MB/s.

Table 2. Latency and Bandwidth of Point to Point Communication on 16

cores for different processor with similar interconnect type

Interconnect Bandwidth (MB/s) Latency (µs)

Gigabit Ethernet (Xeon processor) 111.57 46.54

Gigabit Ethernet (Opteron processor) [10] 112.5 46.52

The reason for the relatively higher latency and lower

bandwidth on Khaldun cluster as compared to Barcelona might due to the fact that Khaldun machine only has 2 GHz processor and 8 GB RAM whereas Barcelona has slightly faster 2.3 GHz processor and larger, 16 GB RAM. Therefore, the performance of MPI point-to-point message passing on Barcelona was marginally better than Khaldun due to the processor’s speed and the size of the memory.

• Combine Send and Receive Figure 8 shows measurements for MPI_Sendrecv for

SKaMPI and IMB which uses slightly different communication technique as compared to MPI_Send/MPI_Recv. It demonstrated that Khaldun possessed the bidirectional bandwidth facilities, in which messages can be sent simultaneously in both directions.

Nevertheless, IMB results after communicating node were specified were slightly different as compared to Figure 6. Starting at 65,536 bytes threshold IMB’s latency was higher than SKaMPI’s as presented in Figure 8.

Figure 8. Comparison between SKaMPI and IMB for point to point (combine send and receive) communication on 16 cores on Khaldun

B. Collective Communication

• Broadcast

MPI_Bcast is one of the most commonly used collective routines. This routine enables the root process to broadcasts the data from the buffer to all processes in the communicator [21]. By default, IMB placed the data to be broadcasted in cache memory first before a measurement is taken and it assigns a different root processor for each repetition [3].

However, SKaMPI by default ensures that the data to be

broadcasted are not in cache but are fetched directly from the main memory. For the broadcast synchronization, it uses MPI_Barrier as an additional operation before each repetition to avoid bias results since the root node is the first to complete in SKaMPI [3].

Figure 9 shows that SKaMPI and IMB provided rather

similar results for broadcast operation. However, for larger messages, IMB gave higher results as compared to SKaMPI. An average time taken by IMB to complete the broadcast operation on 16 cores was longer than those obtained by SKaMPI.

IMB gave higher results due to the additional overhead of

changing the root node during iteration in the communication. It was noted that from 262,144 to 1,048,576 bytes, the result showed a gap as the change-over point of algorithm used to


211

broadcast message took place. After that point, it reverted back to the previous trend.

Figure 9. Comparison between SKaMPI and IMB for MPI_Bcast on 16 cores on Khaldun

Figure 10 shows average time for MPI_Bcast operation on different cores on Khaldun from SKaMPI. As anticipated, MPI_Bcast on 32 and 16 cores gave the highest result as compared to others. It was naturally followed by MPI_Bcast on 8 and 4 cores with MPI_Bcast with 2 cores yielded the lowest average time. The intra-node and inter-node communication affected the results obtained due to the overhead.

Nonetheless, broadcast latency for 32 and 16 cores at

1,048,576 byte were slightly decreased and getting closer to 8, 4 and 2 cores. This happened due to the change-over point of algorithm used to broadcast message from small to medium message sizes. In this case, the change-over point of algorithm used affected the results of inter-node communication which involved number of cores more than eight but not intra-node communication where the algorithm performed well for all message sizes.

Figure 10. MPI_Bcast Comparison from SKaMPI on different cores on Khaldun

• All-to-all MPI_Alltoall refers to operation of sending a distinct data

from all processes to all other processes in the same group [21]. In this operation, each process performs a scatter operation in order. Figure 11 represents SKaMPI result for MPI_Alltoall on different cores on Khaldun cluster.

As expected, MPI_Alltoall operation on largest core gave

the highest latency as compared to smaller number of cores. The results showed consistent increasing trend for all cores which implied that the algorithm used performed well for all message sizes.

Figure 11. SKaMPI results for MPI_Alltoall on Different cores

• Scatter and Gather MPI_Scatter is used to distribute distinct data from the root

process to all processes in the group including itself while MPI_Gather do the reverse operation of MPI_Scatter by recombining the data back from each processor into a single large data set [21]. In this case, each process including root process send the contents of it’s send buffer to the root process.

Figure 12 and 13 show SKaMPI result for MPI_Scatter

and MPI_Gather on different number of cores on Khaldun. MPI_Scatter operation on 32 cores gave the highest result followed by 16, 8, 4 and 2 cores. Similar result repeated for MPI_Gather communication with 32 cores provided the highest result.

From the observation, it can be concluded that

MPI_Scatter and MPI_Gather with 32 and 16 cores provided higher results as compared to others as it involved inter-node communication. It took longer to be completed since they need to distribute and gather data to/from more processors. Accordingly, MPI_Scatter and MPI_Gather communication on 8, 4 and 2 cores can be completed quickly since the communication occured within the same node.

It was also noted that the result’s trend for MPI_Scatter

and MPI_Gather were similar to the results trend of MPI_Alltoall operation i.e. algorithm used to gather and scatter messages performed very well for all message sizes.


212

Figure 12. SKaMPI results for MPI_Scatter on different cores

Figure 13. SKaMPI results for MPI_Gather on different cores

VI. CONCLUSIONS The performance of MPI routines on cluster depends on the

measurement techniques applied by MPI benchmark programs and also how the communication is being synchronized. SKaMPI and IMB presented different results since they utilized different communication method for the cores and broadcast synchronization for communication. However for point to point communication, the results would be virtually similar once the locations of communicating nodes from IMB were specified.

The comparison of results for point to point communication

between Khaldun and Barcelona also demonstrated that the type of processors, clock frequency and size of memory had direct influence on communication performance.

Lastly, mode of communication; inter-node and intra-node

also affected the results. The inter-node communication of Khaldun provided relatively higher latency and lower bandwidth while intra-node the exact opposite since intra-node communication occurred within the same node and produced less overhead as compared to inter-node communication.

ACKNOWLEDGMENT This work was done on Biruni GRID, UPM. Thanks to the

iDEC for the access to the Biruni Grid. Special thanks to Muhammad Farhan Sjaugi for testing support and useful feedback.

REFERENCES

[1] D.A.Grove and P.D. Coddington. Performance Analysis of MPI Communications on the AlphaServer SC. Proc. of APAC’03, Gold Coast, 2003.

[2] T.Worsch, R. Reussner, and W. Augustin. On benchmarking collective mpi operations. In Proceedings of the 9th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 271–279, London, UK, 2002. Springer-Verlag.

[3] N.A.W Abdul Hamid and P.D. Coddington. Comparison of MPI Benchmark Programs on Shared Memory and Distributed Memory Machines (Point-to-Point Communication). In International Journal of High Performance Computing Applications, 7 November 2010.

[4] A. Kayi, E. Kornkven, T. El-Ghazawi and G. Newby. Application Performance Tuning for Clusters with ccNUMA Nodes. In 11th IEEE International Conference on Computational Science and Engineering, 2008.

[5] N.A.W Abdul Hamid, P.D. Coddington and F.A Vaughan. Performance Analysis of MPI Communications on the SGI Altix 3700, Proc. Australian Partnership for Advanced Computing Conference (APAC’05), Gold Coast, Australia, September 2005.

[6] Sadaf R. Alam, et al. Characterization of Scientific Workloads on Systems with Multi-Core Processors. In International Symposium on Workload Characterization, 2006.

[7] A. Kayi, Y. Yao, T. El-Ghazawi, and G. Newby, “Experimental Evaluation of Emerging Multi-core Architectures”, 21st IEEE International Parallel & Distributed Processing Symposium PMEO-PDS workshop proceedings, Long Beach, CA, March 2007.

[8] Milfeld, K.; Purkayastha, A.; Goto, K.; Guiang, C.; Schulz, K. (May 2007) "Effective Use of Multi-Core Commodity Systems in HPC." Submitted to the 8th LCI International Conference on High Performance Clustered Computing. Lake Tahoe, CA.

[9] Richard F. Barret, Sadaf R. Alam and Jeffrey S. Vetter. Performance Evaluation of the Cray XT3 Configured with Dual Core Opteron Processors. In SIGPLAN’05, June 2005.

[10] Swamy .N. Kandadai, and Xinghong He. Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM, 2007.

[11] R. Brightwell, “Exploiting direct access shared memory for mpi on multi-core processors,” Int. J. High Perform. Comput. Appl., vol. 24, no. 1, pp. 69–77, 2010.

[12] N.A.W Abdul Hamid and P.D. Coddington. Averages, Distributions and Scalability of MPI Communication Times for Ethernet and Myrinet Networks.

[13] N.A.W Abdul Hamid and P.D. Coddington. Analysis of Algorithm Selection for Optimizing Collective Communication with MPICH for Ethernet and Myrinet Networks.

[14] SKaMPI. http://liinwww.ira.uka.de/~skampi/ [15] Mpptest. http://www.mcs.anl.gov/research/projects/mpi/mpptest/ [16] Pallas MPI Benchmark. http://www.pallas.de/pages/pmbd.htm. [17] MPBench. http://icl.cs.utk.edu/projects/llcbench/mpbench.html [18] MPIBench. http://www.dhpc.edelaide.edu.au/projects/MPIBench [19] Biruni Grid. InfoComm Development Centre (iDEC) of University

Putra Malaysia (UPM). http://biruni.upm.my/ [20] Intel. http://ark.intel.com/ [21] MPI: A Message Passing Interface Standard. http://www.mpi-

forum.org/docs/mpi1-report.pdf


213

Documents

[IEEE 2011 IEEE Conference on Open Systems (ICOS) - Langkawi, Malaysia (2011.09.25-2011.09.28)] 2011 IEEE Conference on Open Systems - MPI communication benchmarking on intel Xeon