Implementing a Linux Cluster

Implementing a Linux Cluster

A project reportsubmitted in partial fulfilment of

the requirements for the award of the degree of

Bachelor of Technologyin

Computer Science and Engineeringby

Deepak LukoseRoll No:16Group No:8

S8 CSE

Department of Computer Engineering

National Institute of Technology, CalicutKerala - 673601

2006

Certificate

This is to certify that the project entitled “Implementing a Linux Cluster” is

a bonafide record of the project presented by Deepak Lukose, (Y2.025) under

our supervision and guidance. The project report has been submitted to the De-

partment of Computer Engineering of National Institute of Technology,

Calicut in partial fulfilment of the award of the Degree of Bachelor of Technol-

ogy in Computer Science and Engineering.

Dr. M.P. SebastianProfessor and HeadDept.of Computer EngineeringNIT Calicut

Mr. Vinod PathariLecturerDept.of Computer EngineeringNIT Calicut

iii

Abstract

A computer cluster is a group of loosely coupled computers that work to-

gether closely so that in many respects it can be viewed as though it were a single

computer. Clusters are commonly connected through fast local area networks and

are usually deployed to improve speed and/or reliability over that provided by a

single computer, while typically being much more cost-effective than single com-

puters of comparable speed or reliability.

Clusters built from open source software, particularly based on the GNU/Linux

operating system, are increasingly popular. Their success is not hard to explain

because they can cheaply solve an ever-widening range of number-crunching appli-

cations. A wealth of open source or free software has emerged to make it easy to

set up, administer, and program these clusters. This work aims at an implemen-

tation of free and open source clusters for performing scientific computations at a

faster pace.

iv

Acknowledgements

I express my sincere thanks to Mr. Vinod Pathari for his constant backing

and support. I would like to extend my gratitude to entire faculty and staff of CSE

Department of NITC, who stood by me in all the difficulties I had to face during

the completion of this project. Last but not the least I thank God Almighty for

being the guiding light ...all throughout.

Deepak Lukose

v

Contents

Chapter

1 Introduction 1

1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Cluster history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 High-performance (HPC) clusters . . . . . . . . . . . . . . . . . . . 2

1.4 Cluster technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Motivation 4

3 Design 5

3.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.2 Operating System . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.3 Cluster Software . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Cluster Installation 8

4.1 OSCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Installation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 10

vi

5 Parallel Programming 11

5.1 Hello World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.1.1 MPI Init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1.2 MPI Finalize . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1.3 MPI Comm size . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.4 MPI Comm rank . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.5 MPI Get processor name . . . . . . . . . . . . . . . . . . . . 14

5.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Molecular Dynamics Modeling of Thermal Conductivity of Engineering Flu-

ids and its Enhancement due to Nanoparticle Inclusion 19

6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Testing and Benchmarking 26

8 Conclusion 29

Bibliography 30

Chapter 1

Introduction

1.1 Problem Definition

To setup a cluster which can increase the performance of CPU intensive

tasks like scientific computations, rendering graphics etc. In short, to create a

High performance cluster for general purpose computing with special regard to

modeling and simulation of molecular systems.

1.2 Cluster history

The first commodity clustering product was ARCnet, developed by Dat-

apoint in 1977. ARCnet was not a commercial success and clustering did not

really take off until DEC released their VAXcluster product in the 1980s for the

VAX/VMS operating system. The ARCnet and VAXcluster products not only

supported parallel computing, but also shared file systems and peripheral devices.

They were supposed to give us the advantage of parallel processing while maintain-

ing data reliability and uniqueness. VAXcluster, now VMScluster, is still available

on OpenVMS systems from HP running on Alpha and Itanium systems.

The history of cluster computing is intimately tied up with the evolution of

networking technology. As networking technology has become cheaper and faster,

cluster computers have become significantly more attractive.

2

1.3 High-performance (HPC) clusters

High-performance clusters are implemented primarily to provide increased

performance by splitting a computational task across many different nodes in the

cluster, and are most commonly used in scientific computing. One of the more

popular HPC implementations is a cluster with nodes running Linux as the OS

and free software to implement the parallelism. This configuration is often referred

to as a Beowulf cluster. Such clusters commonly run custom programs which have

been designed to exploit the parallelism available on HPC clusters. Many such

programs use libraries such as Message Passing Interface(MPI) which are specially

designed for writing scientific applications for HPC computers.

1.4 Cluster technologies

The Message Passing Interface (MPI) is a computer communications protocol.

It is a de facto standard for communication among the nodes running a parallel

program on a distributed memory system. MPI is a library of routines that can

be called from Fortran, C, C++ and Ada programs. MPI’s advantage over older

message passing libraries is that it is both portable (because MPI has been imple-

mented for almost every distributed memory architecture) and fast (because each

implementation is optimized for the hardware it runs on).

The GNU/Linux world sports various cluster software, such as:

Specialized application clusters - Those clusters which are built using special-

ized applications fall in this category. Eg: Beowulf, distcc, MPICH and other.

distcc provides parallel compilation when using GCC.

Director-based clusters - These clusters allow incoming requests for services to

be distributed across multiple cluster nodes. Eg:Linux Virtual Server, Linux-HA

Full-blown clusters - This category includes those clusters which integrated into

the kernel mechanism for automatic process migration among homogeneous nodes.

3

Eg:Mosix, openMosix, Kerrighed, OpenSSI etc. OpenSSI and Kerrighed are

single-system image implementations.

DragonFly BSD, a recent fork of FreeBSD 4.8 is being redesigned at its core

to enable native clustering capabilities. It also aims to achieve single-system image

capabilities.

MSCS is Microsoft’s high-availability cluster service for Windows. Based

on technology developed by Digital Equipment Corporation. The current version

supports up to eight nodes in a single cluster, typically connected to a SAN. A set

of APIs support cluster-aware applications, generic templates provide support for

non-cluster aware applications.

Grid computing is a technology closely related to cluster computing. The key

differences between grids and traditional clusters are that grids connect collections

of computers which do not fully trust each other, and hence operate more like a

computing utility than like a single computer. In addition, grids typically support

more heterogeneous collections than are commonly supported in clusters.

Chapter 2

Motivation

Most of the time, the computer is idle. Start a program like xload or top

that monitors the system use, and one will probably find that the processors load

is not even hitting the 1.0 mark. If one has two or more computers, chances are

that at any given time, at least one of them is doing nothing. Unfortunately, when

we really do need CPU power - during a C++ compile, or encoding Ogg Vorbis

music files - we need a lot of it at once. The idea behind clustering is to spread

these loads among all available computers, using the resources that are free on

other machines.

The basic unit of a cluster is a single computer, also called a “node”. Clusters

can grow in size - they “scale” - by adding more machines. The power of the

cluster as a whole will be based on the speed of individual computers and their

connection speeds are. In addition, the operating system of the cluster must make

the best use of the available hardware in response to changing conditions. This

becomes more of a challenge if the cluster is composed of different hardware types (a

“heterogeneous” cluster), if the configuration of the cluster changes unpredictably

(machines joining and leaving the cluster), and the loads cannot be predicted ahead

of time.

Chapter 3

Design

Designing a cluster entails four sets of design decisions:

(1) Determine the overall mission of the cluster.

(2) Select a general architecture for the cluster.

(3) Select the operating system, cluster software, and other system software

that will be used.

(4) Select the hardware for the cluster.

While each of these tasks, in part, depends on the others, the first step

is crucial. If at all possible, the cluster’s mission should drive all other design

decisions. At the very least, the other design decisions must be made in the

context of the cluster’s mission and be consistent with it.

Selecting the hardware should be the final step in the design, but often we

won’t have as much choice as we would like. A number of constraints may force us

to select the hardware early in the design process. The most obvious is the budget

constraints.

Defining what we want to do with the cluster is really the first step in de-

signing it. For many clusters, the mission will be clearly understood in advance.

This is particularly true if the cluster has a single use or a few clearly defined uses.

6

But it should be noted that clusters have a way of evolving. What may be a

reasonable assessment of needs today may not be tomorrow. Good design is often

the art of balancing today’s resources with tomorrow’s needs.

3.1 Design decisions

3.1.1 Mission

To setup a cluster which can increase the performance of CPU intensive

tasks like scientific computations, rendering graphics etc. In short, to create a

High performance cluster for general purpose computing with special regard to

modeling and simulation of molecular systems.

3.1.2 Operating System

RedHat Enterprise Linux 4(RHEL4) was selected because it is one of the

most widely supported platform by all cluster softwares.

3.1.3 Cluster Software

OSCAR (Open Source Cluster Application Resources) is a high performance

Linux cluster that is available for parallel/distributed computing needs. OSCAR

scales well over Linux distributions. For greater control over how the cluster is

configured, one will be happier with OSCAR in the long run. Typically, OSCAR

provides better documentation than other Cluster kits like Rocks. OSCAR was

chosen over traditional Beowulf due to the ease of installation as well as its com-

prehensive package which includes many compilers and application softwares. One

of the main package that is being used on OSCAR is LAM-MPI, a popular imple-

mentation of the MPI parallel programming paradigm. LAM-MPI can be used for

developing parallel programs and it is one of the leading implementations available.

7

3.1.4 Hardware

Hardware preferred is P4 3.0GHz machines with 2GB DDR RAM(on each

node) and 1Gbps/100Mbps NIC which supports PXE(Preboot Execution Envi-

ronment), for network booting on computers. The Head node must have at least

7GB Hard disk capacity and all other client nodes need a minimum of 5GB hard

disk capacity.

Chapter 4

Cluster Installation

One of the more important developments in the short life of high perfor-

mance clusters has been the creation of cluster installation kits such as OSCAR

(Open Source Cluster Application Resources) and Rocks. With software packages

like these, it is possible to install everything one needs and very quickly have a

fully functional cluster. A fully functional cluster will have a number of software

packages each addressing a different need, such as programming, management, and

scheduling.

4.1 OSCAR

OSCAR is a software package that is designed to simplify cluster installation.

A collection of open source cluster software, OSCAR includes everything that one

is likely to need for a dedicated, high-performance cluster. OSCAR uses a best-in-

category approach, selecting the best available software for each type of cluster-

related task. One will often have several products to choose from for any given

need.

The design goals for OSCAR include using the best-of-class software, elimi-

nating the downloading, installation, and configuration of individual components,

and moving toward the standardization of clusters. OSCAR, it is said, reduces

the need for expertise in setting up a cluster because OSCAR takes us completely

through the installation of a cluster. In practice, it might be more fitting to say

9

that OSCAR delays the need for expertise and allows us to create a fully functional

cluster before mastering all the skills one will eventually need. In the long run,

one will want to master those packages in OSCAR. OSCAR makes it very easy to

experiment with packages and dramatically lowers the barrier to getting started.

OSCAR is designed with high-performance computing in mind. So, unless

one customizes the installation, the computer nodes are meant to be dedicated to

the cluster.

4.2 Packages

OSCAR brings together a number of software packages for clustering. Most

of the packages are available as standalone packages. The main packages in OS-

CAR are:

Core: This is the core OSCAR package.

C3: The Cluster, Command, and Control tool suite provides a command-line ad-

ministration interface.“cexec mkdir /opt/c3-4” will create the directory on all

clients. Similarly we can use commands like cget, ckill, cpush, crm, cshutdown.

Environmental Switcher: This is based on Modules, a Perl script that al-

lows the user to make changes to the environment of future shells. For exam-

ple, Switcher allows a user to change between MPICH and LAM/MPI. “switcher

mpi = mpich-ch-p4-gcc-1.2.5.10” can be used to set the execution environ-

ment needed for mpich library.

SIS: The System Installation Suite is used to install the operating systems on the

clients.

Monitoring systems: Clumon, a web-based performance-monitoring system,

and Ganglia, a real-time monitoring system and execution environment, are the

softwares included in this category.

MAUI: This job scheduler is used with openPBS.

10

openPBS: The portable batch system is a workload management system.

PVFS: Parallel Virtual File System is a high-performance, scalable, parallel vir-

tual file system.

Any high-performance cluster would be incomplete without programming

tools. The OSCAR distribution includes:

LAM/MPI :This is one implementation of the message passing interface (MPI)

libraries.

MPICH :This is another implementation of the message passing interface (MPI)

libraries.

PVM :This package provides the parallel virtual machine system, another mes-

sage passing library.

4.3 Installation Strategy

With OSCAR, one first installs Linux (but only on the head node) and then

installs OSCAR. The installations of the two are separate. This makes the instal-

lation more involved, but it gives us more control over the configuration of the

system, and it is somewhat easier to recover when we encounter installation prob-

lems. And because the OSCAR installation is separate from the Linux installation,

we are not tied to a single Linux distribution.

OSCAR uses a system image cloning strategy to distribute the disk image to

the compute nodes. With OSCAR it is best to use the same hardware throughout

the cluster. OSCAR’s thin client model is designed for diskless systems.

Chapter 5

Parallel Programming

5.1 Hello World

It is customary to start a new programming language with “Hello world”

program. So here also we will start with “Hello World” program. The parallel

version of the program(using LAM/MPI) is as follows:

#include "mpi.h"

#include <stdio.h>

int main(int argc, char *argv[])

{

int processId; /* rank of process */

int noProcesses; /* number of processes */

int nameSize; /* length of name */

char computerName[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);

MPI_Comm_rank(MPI_COMM_WORLD, &processId);

MPI_Get_processor_name(computerName, &nameSize);

12

fprintf(stderr,"Hello from process %d on %s\n", processId, computerName);

MPI_Finalize( );

return 0;

}

This example introduces five MPI functions, defined through the inclusion

of the header file for the MPI library, mpi.h, and included when the MPI library

is linked to the program. While this example uses C, similar libraries are available

for C++ and FORTRAN.

Four of these functions, MPI Init, MPI Comm size, MPI Comm rank, and

MPI Finalize, are seen in virtually every MPI program.

5.1.1 MPI Init

MPI Init is used to initialize an MPI session. All MPI programs must have

a call to MPI Init. MPI Init is called once, typically at the start of a program.

One can have lots of other code before this call, or one can even call MPI Init

from a subroutine, but one should call it before any other MPI functions are

called. (There is an exception: the function MPI Initialized can be called before

MPI Init. MPI Initialized is used to see if MPI Init has been previously called. )

In C, MPI Init can be called with the addresses for argc and argv as shown in the

example. This allows the program to take advantage of command-line arguments.

Alternatively, these addresses can be replaced with a NULL.

5.1.2 MPI Finalize

MPI Finalize is called to shut down MPI. MPI Finalize should be the last

MPI call made in a program. It is used to free memory, etc. It is the user’s

13

responsibility to ensure that all pending communications are complete before a

process calls MPI Finalize. Every process must call MPI Finalize.

5.1.3 MPI Comm size

This routine is used to determine the total number of processes running in a

communicator (the communications group for the processes being used). It takes

the communicator as the first argument and the address of an integer variable

used to return the number of processes. For example, if one is executing a pro-

gram using five processes and the default communicator, the value returned by

MPI Comm size will be five, the total number of processes being used. This is

number of processes, but not necessarily the number of machines being used.

In the hello world program, both MPI Comm size and MPI Comm rank used

the default communicator, MPI COMM WORLD. This communicator includes all

the processes available at initialization and is created automatically. Communica-

tors are used to distinguish and group messages. As such, communicators provide

a powerful encapsulation mechanism. While it is possible to create and manipulate

one’s own communicators, the default communicator will probably satisfy most of

the needs.

5.1.4 MPI Comm rank

MPI Comm rank is used to determine the rank of the current process within

the communicator. MPI Comm rank takes a communicator as its first argument

and the address of an integer variable is used to return the value of the rank.

Basically, each process is assigned a different process number or rank within

a communicator. Ranks range from 0 to one less than the size returned by

MPI Comm size. For example, if one is running a set of five processes, the individ-

ual processes will be numbered 0, 1, 2, 3, and 4. By examining its rank, a process

14

can distinguish itself from other processes. The values returned by MPI Comm size

and MPI Comm rank are often used to divide up a problem among processes.

Next, each individual process can examine its rank to determine its role in the

calculation. For example, the process with rank 0 might work on the first part of

the problem; the process with rank 1 will work on the second part of the problem,

etc. One can divide up the problem differently also. For example, the process with

rank 0 might collect all the results from the other processes for the final report

rather than participate in the actual calculation. It is really up to the programmer

to determine how to use this information.

5.1.5 MPI Get processor name

MPI Get processor name is used to retrieve the host name of the node on

which the individual process is running. In the sample program, we used it to

display host names. The first argument is an array to store the name and the

second is used to return the actual length of the name.

Each of the C versions of these five functions returns an integer error code.

With a few exceptions, the actual code is left up to the implementers. Error codes

can be translated into meaningful messages using the MPI Error string function.

Here is an example of compiling and running the code:

[deepak@ssl1 ~]$ mpicc hello.c -o hello

[deepak@ssl1 ~]$ mpirun -np 5 hello

Messages received by process 0 on ssl1.

Greetings from process 1 on ssl3.deltaforce!

Greetings from process 2 on ssl1!

Greetings from process 3 on ssl3.deltaforce!

Greetings from process 4 on ssl1!

15

There is no apparent order in the output. This will depend on the speed

of the individual machines, the loads on the machines, and the speeds of the

communications links. Unless one takes explicit measures to control the order of

execution among processors, one should not make assumptions about the order of

execution.

When running the program, the user specifies the number of processes on

the command line. MPI Comm size provides a way to get that information back

into the program. Next time, if one wants to use a different number of processes,

just change the command line and the code will take care of the rest.

5.2 Numerical Integration

This is a fairly standard problem for introducing parallel calculations because

it can be easily decomposed into parts that can be shared among the computers

in a cluster. Although in most cases it can be solved quickly on a single processor,

the parallel solution illustrates all the basics one needs to get started writing MPI

code. The reason this area problem is both interesting and commonly used is

that it is very straightforward to subdivide this problem. We can let different

computers calculate the areas for different rectangles. Basically, MPI Comm size

and MPI Comm rank are used to divide the problem among processors. MPI Send

is used to send the intermediate results back to the process with rank 0, which

collects the results with MPI Recv and prints the final answer. Here is the program:

#include "mpi.h"

#include <stdio.h>

/* problem parameters */

#define f(x) ((x) * (x))

#define numberRects 50

16

#define lowerLimit 2.0

#define upperLimit 5.0

int main(int argc, char *argv[])

{

/* MPI variables */

int dest, noProcesses, processId, src, tag;

MPI_Status status;

/* problem variables */

int i;

double area, at, height, lower, width, total, range;

/* MPI setup */

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);

MPI_Comm_rank(MPI_COMM_WORLD, &processId);

/* adjust problem size for subproblem*/

range = (upperLimit - lowerLimit) / noProcesses;

width = range / numberRects;

lower = lowerLimit + range * processId;

/* calculate area for subproblem */

area = 0.0;

for (i = 0; i < numberRects; i++)

{

17

at = lower + i * width + width / 2.0;

height = f(at);

area = area + width * height;

}

/* collect information and print results */

tag = 0;

if (processId = = 0) /* if rank is 0, collect results */

{

total = area;

for (src=1; src < noProcesses; src++)

{

MPI_Recv(&area, 1, MPI_DOUBLE, src, tag, MPI_COMM_WORLD, &status);

total = total + area;

}

fprintf(stderr, "The area from %f to %f is: %f\n",

lowerLimit, upperLimit, total );

}

else /* all other processes only send */

{

dest = 0;

MPI_Send(&area, 1, MPI_DOUBLE, dest, tag, MPI_COMM_WORLD);

};

/* finish */

MPI_Finalize( );

return 0;

}

18

In this example, we are calculating the area under the curve y = f(x2) between

x=2 and x=5. Since each process only needs to do part of this calculation, we need

to divide the problem among the processes so that each process gets a different

part and all the parts are accounted for. MPI Comm size is used to determine the

number of parts the problem will be broken into, noProcesses. That is, we divide

the total range (2 to 5) equally among the processes and adjust the start of the

range for an individual process based on its rank. In the next section of code, each

process calculates the area for its part of the problem. Then we need to collect

and combine all our individual results. One process will act as a collector to which

the remaining processes will send their results. Using the process with rank 0 as

the receiver is the logical choice. The remaining processes act as senders. A fair

amount of MPI code development can be done on a single processor system and

then moved to a multiprocessor environment.

Chapter 6

Molecular Dynamics Modeling of Thermal Conductivity of

Engineering Fluids and its Enhancement due to Nanoparticle Inclusion

In the present study, an attempt is made to estimate the enhancement of the

thermal conductivity of water by suspension of nanoparticles, using the method

of Molecular Dynamics simulation. This involves the process of generating the

atomic trajectories of a system of a finite number of particles by direct integration

of the classical Newton’s equations of motion, with appropriate specification of

interatomic potentials and application of suitable initial and boundary conditions.

Initially a general simulation procedure is developed for the thermal conductivity

of a liquid and the procedure is validated with standard values. A few among

the plausible theoretical models for the thermal conductivity of nanofluids has

been selected and studied. Algorithms are made for simulating them, abiding the

procedural steps of the Molecular Dynamics method. The thermal conductivity

enhancement in the base fluid due to suspension of nanoparticles, estimated using

the simulations of the models considered are compared among themselves and with

the existing experimental results, and further investigated to select the most ap-

propriate model which matches best with a practical case of interest (metal oxide

- water system). Parametric studies are conducted to study the variation of ther-

mal conductivity enhancement with temperature, and the optimal dosing levels

of nanofluids are also investigated upon. Further, an optimization of the simula-

tion procedure and algorithms is attempted to bring out an efficient computation

20

strategy.

6.1 Algorithm

main()

{

initializePositions();

initializeVelocities();

for(time=0; time< totalTimeSteps; time++)

{

velocityVerlet(dt);

instantaneousTemperature();

if(i%200==0)

rescaleVelocities();

if(time>equilibrationtime)

jt();

}

thermalConductivity();

}

velocityVerlet()

{

computeAccelerations(); //at time ’t’

updatepositions; //function of a[t]

computeAccelerations(); //at time ’t+dt’;

updateVelocity; //function of a[t] and a[t+dt]

}

21

6.2 Profiling

It is generally said that a typical program will spend over 90% of its execution

time in less that 10% of the actual code. This is just a rule of thumb or heuristic,

and as such, will be wildly inaccurate or totally irrelevant for some programs.

But for many, if not most, programs, it is a reasonable observation. The actual

numbers don’t matter since they will change from program to program. It is the

idea that is important for most programs, most of the execution time spent is in a

very small portion of the code.

If the application spends 95% of its time in 5% of the code, there is little to

be gained by optimizing the other 95% of the code. Even if one could completely

eliminate it, one would only see a 5% improvement. But if one can manage a 10%

improvement in the critical 5% of the code, for example, we will see a 9.5% overall

improvement in the program. Thus, the key to improving the code’s performance

is to identify that crucial 5%. That is the region where one should spend one’s

time optimizing code.

There is a point of diminishing returns when optimizing code. We will need

to balance the amount of time we spend optimizing code with the amount of

improvement we actually get. There is a point where the code is good enough.

The goals of profiling are two-fold to decide how much optimization is worth doing

and to identify which parts of code should be optimized.

For serial algorithms, one can often make reasonable estimates on how time

is being spent by simply examining and analyzing the algorithm. The standard

approach characterizes performance using some measurement of the problem size.

Since the problem size often provides a bound for algorithmic performance, this

approach is sometimes called asymptotic analysis.

Asymptotic analysis can be problematic with parallel programs for several

reasons. First, it may be difficult to estimate the cost of communications required

22

by a parallel solution. This can be further complicated by the need for additional

code to coordinate communications among the processors. Second, there is often

a less than perfect overlap among the communicating processes. A processor may

be idle while it waits for its next task. In particular, it may be difficult to predict

when a processor will be idle and what effect this will have on overall performance.

For these and other reasons, an empirical approach to estimating performance is

often the preferred approach for parallel programs. That is, we directly measure

performance of existing programs.

Thus, with parallel programs, the most appropriate strategy is to select the

best algorithm one can and then empirically verify its actual performance.

The profile of the unoptimized code is as follows:

[delta16@athena project]$ time -p ./green_without_prof

Thermal Conductivity : 0.000922671

real 378.63

user 378.61

sys 0.00

% cumulative self self total

time seconds seconds calls ms/call ms/call name

49.31 398.03 398.03 40000 9.95 9.95 Accelerations()

34.33 675.12 277.09 1620000 0.17 0.17 lj(int)

15.88 803.34 128.21 15000 8.55 27.09 jt()

0.47 807.09 3.75 20000 0.19 20.09 Verlet(double)

0.13 808.10 1.01 1620000 0.00 0.17 ei(int)

0.00 808.14 0.04 201 0.20 0.20 rescaleVelocities()

<some output removed for brevity>

From this profile we can find that there is scope for improvement in func-

23

tions like Accelerations, lj and jt because they account for more than 99% of the

execution time. It can be noticed that a single call to lj function takes very little

amount of time to execute but since it is executed large number of times, even

a slight improvement in the code will heavily affect the execution time. It can

be found from the code that the functions jt and Accelerations are both having

a complexity of O(N2). So the work done by this part of the code can be split

among the individual nodes.

The profile of the optimized code is as follows:

[delta16@athena project]$ mpirun -np 4 time -p ./mpi_green_pg

Thermal Conductivity : 0.000922359

real 155.74

user 134.28

sys 1.70


% cumulative self self total

time seconds seconds calls ms/call ms/call name

54.35 182.83 182.83 40000 4.57 4.57 Accelerations()

22.37 258.11 75.27 405000 0.19 0.19 lj(int)

19.61 324.08 65.97 15000 4.40 9.45 jt()

2.39 332.12 8.04 20000 0.40 9.54 Verlet(double)


The comparison between the unoptimized and optimized code is shown be-

low:

24

Total Time

050

100150200250300350400450

Before After

self ms/call

0

2

4

6

8

10

12

Before After

No of Calls

05000

1000015000200002500030000350004000045000

Before After

self ms/call

0.16

0.165

0.17

0.175

0.18

0.185

0.19

0.195

Before After

No of Calls

0200000400000600000800000

10000001200000140000016000001800000

Before After

Total Time

0

50

100

150

200

250

300

Before After

self ms/call

0123456789

Before After

No of Calls

02000400060008000

10000120001400016000

Before After

Total Time

0

20

40

60

80

100

120

140

Before After

25

With this optimized code we get a speedup of 2.43 on a single node(4 CPUs)

cluster. The error in the computation of the order of 10−7 is due the presence of

floating point arithmetic. Maximum speedup was achieved on a dual processor

machines by using shared memory instead of message passing. When the program

was run on a cluster with two nodes(4 CPUs) the speedup achieved was in the

range of 1.7 to 1.9 .

Chapter 7

Testing and Benchmarking

Once the cluster is running, one needs to run a benchmark or two just to see

how well it performs.

There are three main reasons for running benchmarks. First, a benchmark

will provide us with a baseline. If we were to make changes to the cluster or

if we suspect problems with the cluster, we can rerun the benchmark to see if

performance is really any different. Second, benchmarks are useful when comparing

systems or cluster configurations. They can provide a reasonable basis for selecting

between alternatives. Finally, benchmarks can be helpful with planning. If we can

run several with differently sized clusters, etc. , we should be able to make better

estimates of the impact of scaling the cluster.

High Performance Linpack, Hierarchical Integration (HINT) and NAS Par-

allel Benchmarks are some common benchmarks available for clusters.

As a yardstick of performance we are using the ‘best’ performance as mea-

sured by the LINPACK Benchmark. LINPACK was chosen because it is widely

used and performance numbers are available for almost all relevant systems.

LINPACK is a software library for performing numerical linear algebra on

digital computers. LINPACK makes use of the BLAS (Basic Linear Algebra Sub-

programs) libraries for performing basic vector and matrix operations. The LIN-

PACK Benchmark is based on LINPACK, representing a measure of a system’s

floating point computing power. It measures how fast a computer solves dense

27

n by n systems of linear equations Ax=b, a common task in engineering. The

solution is based on Gaussian elimination with partial pivoting, with 23∗ n3 + n2

floating point operations. The result is Millions of floating point operations per

second(Mflop/s).

For large scale distributed memory systems, the performance of a portable

implementation of the High-Performance Linpack Benchmark link is used as a per-

formance measure for ranking supercomputers in the TOP500 list of the world’s

fastest computers. This performance does not reflect the overall performance of

a given system, as no single number ever can. It does, however, reflect the per-

formance of a dedicated system for solving a dense system of linear equations.

Since the problem is very regular, the performance achieved is quite high, and

the performance numbers give a good correction of peak performance. When the

High Performance Linpack was run on the cluster it gave a peak performance of

1.380e-01 GFLOPS.

The folowing figure explains the relationship between the execution time and

the number of available processors.

28

Chapter 8

Conclusion

The basic objective of this project was to setup a high performance compu-

tational cluster with special concern to molecular modelling. By using a cluster kit

such as OSCAR,the first phase of the project, setting up a high performance clus-

ter, could be completed. With the help of message passing libraries like LAM/MPI,

the second phase of the project, improving the performance of a molecular mod-

elling problem, was completed. The molecular modelling problem, which is con-

sidered to be a standard benchmark, when tested on a two node cluster gave a

remarkable speedup of the order of 2.0 . This increased efficiency came at the

expense of higher code complexity. Finally, the testing phase was completed using

the High Performance Linpack benchmark which showed a peak performance of

1.380e-01 GFLOPS. Future work includes the formal analysis of existing code for

automated conversion to a parallel version for cluster implementation.

30

Bibliography

[1] Joseph D. Sloan, High Performance Linux Clusters with OSCAR, Rocks,OpenMosix, and MPI, O’Reilly & Associates, 1991.

[2] Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, TataMcGraw-Hill, 2003.

[3] Zoltan Juhasz, Peter Kacsuk and Dieter Kranzimuller, Distributed and Par-allel Systems-Cluster and Grid Computing, Springer, 2002.

[4] Stefan Bohringer, Building a diskless Linux Cluster for high performancecomputations from a standard Linux distribution, Technical Report, Institutfur Humangenetik, Universitatsklinikum Essen, 2003.

[5] Linux clusters information centre, http://lcic.org/

[6] Linux Documentation Project, http://www.tldp.org/HOWTO/openMosix-HOWTO/index.html

[7] openMosix Development site, http://openmosix.sourceforge.net/

[8] openMosix website, http://openmosix.org

[9] Oscar website, http://oscar.openclustergroup.org

[10] Rocks cluster website, http://www.rocksclusters.org

[11] Openssi website, http://www.openssi.org/

Documents

Implementing a Linux Cluster