Design and Implementation of High Performance Computing ... · PDF filePerformance Computing Cluster for Educational Purpose ... Design and Implementation of High Performance Computing

Design and Implementation of High

Performance Computing Cluster for

Educational Purpose

Dissertation

submitted in partial fulfillment of the requirements

for the degree of

Master of Technology, Computer Engineering

by

SURAJ CHAVAN

Roll No: 121022015

Under the guidance of

PROF. S. U. GHUMBRE

Department of Computer Engineering and Information Technology

College of Engineering, Pune

Pune - 411005.

June 2012

Dedicated to

My Mother

Smt. Kanta Chavan

DEPARTMENT OF COMPUTER

ENGINEERING AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE

CERTIFICATE

This is to certify that the dissertation titled

Design and Implementation of High PerformanceComputing Cluster for Educational Purpose

has been successfully completed

By

SURAJ CHAVAN

(121022015)

and is approved for the degree of

Master of Technology, Computer Engineering.

PROF. S. U. GHUMBRE, DR. JIBI ABRAHAM,

Guide, Head,

Department of Computer Engineering Department of Computer Engineering

and Information Technology, and Information Technology,

College of Engineering, Pune, College of Engineering, Pune,

Shivaji Nagar, Pune-411005. Shivaji Nagar, Pune-411005.

Date :

Abstract

This project work confronts the issue of bringing high performance computing

(HPC) education to those who do not have access to a dedicated clustering en-

vironment in an easy, fully-functional, inexpensive manner through the use of

normal old PCs, fast Ethernet and free and open source softwares like Linux,

MPICH, Torque, Maui etc. Many undergraduate institutions in India do not

have the facilities, time, or money to purchase hardware, maintain user accounts,

configure software components, and keep ahead of the latest security advisories

for a dedicated clustering environment. The projects primary goal is to provide

an instantaneous, distributed computing environment. A consequence of provid-

ing such an environment is the ability to promote the education of high perfor-

mance computing issues at the undergraduate level through the ability to turn

an ordinary off the shelf networked computers into a non-invasive, fully-functional

cluster. The cluster is used to solve problems which require high degree of com-

putation like satisfiability problem for Boolean circuits, Radix-2 FFT algorithm,

1 dimensional time dependent heat equation and other. Also the cluster is bench-

marked by using High Performance Linpack and HPCC benchmark suite. This

cluster can be used for research on data mining applications with large data sets,

object-oriented parallel languages, recursive matrix algorithms, network protocol

optimization, graphical rendering, Fast Fourier transforms, built college’s private

cloud etc. Using this cluster students and faculty will receive extensive experience

in configuration, troubleshooting, utilization, debugging and administration issues

uniquely associated with parallel computing using such cluster. Several students

and faculty can use it for their project and research work in near future.

iii

Acknowledgments

It is great pleasure for me to acknowledge the assistance and contribution of num-

ber of individuals who helped me in my project titled Design and Implementation

of HPCC for Educational Purpose.

First and foremost I would like to express deepest gratitude to my Guide Prof.

S.U. Ghumbre who has encouraged, supported and guided me during every step

of the Project. Without his invaluable advice completion of this project would not

be possible. I take this opportunity to thank our Head of Department, Prof. Dr.

Jibi Abraham for her able guidance and for providing all the necessary facilities,

which were indispensable in the completion of this project. I am also thankful

to the staff of Computer Engineering Department for their invaluable suggestions

and advice. I thank the college for providing the required magazines, books and

access to the Internet for collecting information related to the Project.

I am thankful to Dr. P. K. Sinha, Senior Director HPC, C-DAC, Pune for

granting me permission to study C-DAC’s PARAM Yuva facility. I am also thank-

ful to Dr. Sandeep Joshi and Mr. Rishi Pathak, Mr. Vaibhav Pol of PARAM

Yuva Supercomputing facility, C-DAC, Pune for their continuous encouragement

and support throughout the course of this project.

Last, but not the least, I am also grateful to my friends for their valuable

comments and suggestions.

iv

Contents

Abstract iii

Acknowledgments iv

List of Figures vi

1 Introduction 1

1.1 High Performance Computing . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Types of HPC architectures . . . . . . . . . . . . . . . . . . 2

1.1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Characteristics and features of clusters . . . . . . . . . . . . . . . . 4

1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Survey 6

2.1 HPC oppurtunities in Indian Market . . . . . . . . . . . . . . . . . 6

2.2 HPC at Indian Educational Institutes . . . . . . . . . . . . . . . . . 6

2.3 C-DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 C-DAC and HPC . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 PARAM Yuva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 GARUDA: The National Grid Computing Initiative of India 10

2.5.2 Garuda: Objectives . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7 Single Program, Multiple Data (SPMD) . . . . . . . . . . . . . . . 13

2.8 Message Passing and Parallel Programming Protocols . . . . . . . . 14

2.8.1 Message Passing Models . . . . . . . . . . . . . . . . . . . . 14

2.9 Speedup and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.9.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.9.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.9.3 Factors affecting performance . . . . . . . . . . . . . . . . . 19

2.9.4 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.10 Maths Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.11 HPL Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.11.1 Description of the HPL.dat File . . . . . . . . . . . . . . . . 25

2.11.2 Guidelines for HPL.dat configuration . . . . . . . . . . . . . 30

2.12 HPCC Challenge Benchmark . . . . . . . . . . . . . . . . . . . . . 32

3 Design and Implementation 35

3.1 Beowulf Clusters: A Low cost alternative . . . . . . . . . . . . . . . 35

3.2 Logical View of proposed Cluster . . . . . . . . . . . . . . . . . . . 36

3.3 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Master Node . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2 Compute Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Softwares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.1 MPICH2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.2 HYDRA: Process Manager . . . . . . . . . . . . . . . . . . . 44

3.4.3 TORQUE: Resource Manager . . . . . . . . . . . . . . . . . 44

3.4.4 MAUI: Cluster Scheduler . . . . . . . . . . . . . . . . . . . . 45

3.5 System Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Experiments 48

4.1 Finding Prime Numbers . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 PI Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Circuit Satisfiability Problem . . . . . . . . . . . . . . . . . . . . . 50

4.4 1D Time Dependent Heat Equation . . . . . . . . . . . . . . . . . . 51

4.4.1 The finite difference discretization . . . . . . . . . . . . . . . 51

4.4.2 Using MPI to compute the solution . . . . . . . . . . . . . . 53

4.5 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.1 Radix-2 FFT algorithm . . . . . . . . . . . . . . . . . . . . 54

4.6 Theoretical Peak Performance . . . . . . . . . . . . . . . . . . . . . 55

4.7 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 HPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8.1 HPL Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8.2 Run HPL on cluster . . . . . . . . . . . . . . . . . . . . . . 58

vi

4.8.3 HPL results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 Run HPCC on cluster . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.9.1 HPCC Results . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Results and Applications 63

5.1 Discussion on Results . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1.1 Observations about Small Tasks . . . . . . . . . . . . . . . . 63

5.1.2 Observations about Larger Tasks . . . . . . . . . . . . . . . 63

5.2 Factors affecting Cluster performance . . . . . . . . . . . . . . . . . 64

5.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4 Challenges of parallel computing . . . . . . . . . . . . . . . . . . . . 65

5.5 Common applications of high-performance computing clusters . . . 67

6 Conclusion and Future Work 69

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Bibliography 71

Appendix A PuTTy 74

A.1 How to use PuTTY to connect to a remote computer . . . . . . . . 74

A.2 PSCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.2.1 Starting PSCP . . . . . . . . . . . . . . . . . . . . . . . . . 76

A.2.2 PSCP Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

vii

List of Figures

1.1 Basic Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Evolution of PARAM Supercomputers & HPC Roadmap . . . . . . 8

2.2 Block Diagram of PARAM Yuva . . . . . . . . . . . . . . . . . . . . 9

2.3 Single Instruction, Multiple Data streams (SISD) . . . . . . . . . . 12

2.4 Single Instruction, Multiple Data streams (SIMD) . . . . . . . . . . 12

2.5 Multiple Instruction, Single Data stream (MISD) . . . . . . . . . . 13

2.6 Multiple Instruction, Multiple Data streams (MIMD) . . . . . . . . 13

2.7 General MPI Program Structure . . . . . . . . . . . . . . . . . . . . 17

2.8 Speedup of a program using multiple processors . . . . . . . . . . . 21

3.1 The Schematic structure of proposed cluster . . . . . . . . . . . . . 35

3.2 Logical view of proposed cluster . . . . . . . . . . . . . . . . . . . . 36

3.3 The Network interconnection . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Graph showing performance for Finding Primes . . . . . . . . . . . 49

4.2 Graph showing performance for Calculating π . . . . . . . . . . . . 50

4.3 Graph showing performance for solving C-SAT Problem . . . . . . . 51

4.4 Graph showing performance for solving 1D Time Dependent Heat

Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.5 Symbolic relation between four nodes . . . . . . . . . . . . . . . . . 52

4.6 Graph showing performance Radix-2 FFT algorithm . . . . . . . . . 54

4.7 8-point Radix-2 FFT: Decimation in frequency form . . . . . . . . . 55

4.8 Graph showing High Performance Linpack (HPL) Results . . . . . . 60

5.1 Application Perspective of Grand Challenges . . . . . . . . . . . . . 67

A.1 Putty GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.2 Putty Security Alert . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.3 Putty Remote Login Screen . . . . . . . . . . . . . . . . . . . . . . 76

Chapter 1

Introduction

HPC is a collection or cluster of connected, independent computers that work in

unison to solve a problem. In general, the machines are tightly coupled at one

site, connected by Infiniband or some other high-speed interconnect technology.

With HPC, the primary goal is to crunch numbers, not to sort data. It demands

specialized program optimizations to get the most from a system in terms of

input/output, computation, and data movement. And the machines all have to

trust each other because theyre shipping information back and forth.

Development of new materials and production processes, based on high-technologies,

requires a solution of increasingly complex computational problems. However,

even as computer power, data storage, and communication speed continue to im-

prove exponentially, available computational resources are often failing to keep up

with what users demand of them. Therefore high-performance computing (HPC)

infrastructure becomes a critical resource for research and development as well as

for many business applications. Traditionally the HPC applications were oriented

on the use of high-end computer systems - so-called ”supercomputers”.

1.1 High Performance Computing

The High Performance Computing (HPC) allows scientists and engineers to deal

with very complex problems using fast computer hardware and specialized soft-

ware. Since often these problems require hundreds or even thousands of processor

hours to complete, an approach, based on the use of supercomputers, has been tra-

ditionally adopted. Recent tremendous increase in a speed of PC-type computers

opens relatively cheap and scalable solution for HPC using cluster technologies.

Linux clustering is popular in many industries these days. With the advent

of clustering technology and the growing acceptance of open source software, su-

1


percomputers can now be created for a fraction of the cost of traditional high-

performance machines.

Cluster operating systems divide the tasks amongst the available systems.

Clusters of systems or workstations, on the other hand, connect a group of systems

together to jointly share a critically demanding computational task. Theoretically,

a cluster operating system should provide seamless optimization in every case.

At the present time, cluster server and workstation systems are mostly used

in High Availability applications and in scientific applications such as numerical

computations.

1.1.1 Types of HPC architectures

Most HPC systems use the concept of parallelism. Many software platforms are

oriented for HPC, but first let’s look at the hardware aspects. HPC hardware falls

into three categories:

• Symmetric multiprocessors (SMP)

• Vector processors

• Clusters

Symmetric multiprocessors (SMP)

SMP is a type of HPC architecture in which multiple processors share the same

memory. (In clusters, also known as massively parallel processors (MPPs), they

don’t share the same memory.) SMPs are generally more expensive and less scal-

able than MPPs.

Vector processors

In vector processors, the CPU is optimized to perform well with arrays or vectors;

hence the name. Vector processor systems deliver high performance and were

the dominant HPC architecture in the 1980s and early 1990s, but clusters have

become far more popular in recent years.

Clusters

Clusters are the predominant type of HPC hardware these days; a cluster is a set

of MPPs. A processor in a cluster is commonly referred to as a node and has

its own CPU, memory, operating system, and I/O subsystem and is capable of

2


communicating with other nodes. These days it is common to use a commodity

workstation running Linux and other open source software as a node in a cluster.

Clustering is the use of multiple computers, typically PCs or UNIX work-

stations, multiple storage devices, and redundant interconnections, to form what

appears to users as a single highly available system. Cluster computing can be

used for load balancing, high performance computing as well as for high avail-

ability. It is used as a relatively low-cost form of parallel processing machine for

scientific and other applications that lend themselves to parallel operations. The

Figure 1.1 illustrates a basic cluster.

Figure 1.1: Basic Cluster

Computer cluster technology puts clusters of systems together to provide better

system reliability and performance. Cluster server systems connect a group of

systems together in order to jointly provide processing service for the clients in

the network.

1.1.2 Clustering

The term ”cluster” can take different meanings in different contexts. This section

focuses on three types of clusters:

• Fail-over clusters

• Load-balancing clusters

• High-performance clusters

3

1.2 Characteristics and features of clusters

Fail-over clusters

The simplest fail-over cluster has two nodes: one stays active and the other stays

on stand-by but constantly monitors the active one. In case the active node goes

down, the stand-by node takes over, allowing a mission-critical system to continue

functioning.

Load-balancing clusters

Load-balancing clusters are commonly used for busy Web sites where several nodes

host the same site, and each new request for a Web page is dynamically routed to

a node with a lower load.

High-performance clusters

These clusters are used to run parallel programs for time-intensive computations

and are of special interest to the scientific community. They commonly run simu-

lations and other CPU-intensive programs that would take an inordinate amount

of time to run on regular hardware.

1.2 Characteristics and features of clusters

1. Very high performance-price ratio.

2. Recycling possibilities of the hardware components.

3. Guarantee of usability/upgradeability in the future.

4. Clusters are built using commodity hardware and cost a fraction of the

vector processors. In many cases, the price is lower by more than an order

of magnitude.

5. Clusters use a message-passing paradigm for communication, and programs

have to be explicitly coded to make use of distributed hardware.

6. Open source software components and Linux lead to lower software costs.

7. Clusters have a much lower maintenance cost (they take up less space, take

less power, and need less cooling).

4

1.3 Motivation

1.3 Motivation

1.3.1 Problem Definition

A computer cluster is a group of linked computers, working together closely thus

in many respects forming a single computer. High-performance computing (HPC)

uses supercomputers and computer clusters to solve advanced computation prob-

lems. The benefits of HPCC (High-Performance Computing Clusters) is availabil-

ity, scalability and to a lesser extent, investment protection and simple adminis-

tration.

Portable and extensible parallel computing system has been build with a out-

standing capability near the commercial high performance supercomputer using

general PCs, network facilities, and open softwares such as Linux and MPI etc.

To check clusters performance, the popular HPL (High Performance Linpack)

Benchmark and HPPCC Benchmark suite are used.

1.3.2 Scope

Computing clusters provide a reasonably inexpensive method to aggregate com-

puting power and dramatically cut the time needed to find answers in research

that requires the analysis of vast amounts of data.

This HPCC can be used for research on object-oriented parallel languages,

recursive matrix algorithms, network protocol optimization, graphical rendering

etc. Also it can be used to create college’s own cloud and deploy cloud applications

on it, which can be accessed from anywhere outside world just with the help of

web browser.

1.3.3 Objectives

The projects primary goal is to support an instantaneous, easily available dis-

tributed computing environment. A consequence of providing such an environment

is the ability to promote the education of high performance computing issues at

the undergraduate level through the ability to turn an ordinary off the shelf net-

worked computers into a non-invasive, fully-functional cluster. Using this cluster

students and teachers will be able to gain insight into configuration, utilization,

troubleshooting, debugging, and administration issues uniquely associated with

parallel computing in a live, easy to use clustering environment.Availability of

such system will encourage more and more students and faculty to use it for their

project and research work.

5

Chapter 2

Literature Survey

2.1 HPC oppurtunities in Indian Market

While sectors such as education, R&D, biotechnology, and weather forecasting

have taken some good lead, it is likely to see industries such as oil & gas catching

up soon.

But challenges remain, largely on the application side, such as there is need for

more homegrown applications. Today, bulk of the codes is serial, running multiple

instances of the same code. There is genuine need to focus on code-parallelization

to leverage the true power of HPC. Also, the trend in HPC is toward packing more

and more power into less and less footprint and at the lowest possible price.

Getting people from diverse domains to share and collaborate along one plat-

form is the other challenge facing HPC deployment.

2.2 HPC at Indian Educational Institutes

India has the potential to be a global technology leader. Indian industry is com-

peting globally in various sectors of Science and Engineering. A critical issue for

the future success of state & Indian industry is the growth of engineering and re-

search education in India. High Performance Computing power is key to scientific

& engineering leadership, industrial competitiveness, and national security. Right

now the hardware and expertise which is needed for such systems is available with

few top notch colleges like IISc, IITs and few other renowned institutes. But if we

want to harness the true power of HPC we have to make sure that such systems

are available to each and every engineering college.

6

2.3 C-DAC

2.3 C-DAC

C-DAC was set-up in 1988 with the explicit purpose of demonstrating India’s

HPC capability after the US government denied the import of technology for

weather forecasting purposes. Since then, C-DAC’s developments have mirrored

the progress of HPC computing worldwide.

During the second mission, C-DAC advented the Open Frame Architecture

for cluster computing culminating in the PARAM 10000 in 1998 and the 1TF

PARAM Padma in 2002.

Along with 60 installations worldwide, C-DAC, now has two HPC facilities of

its own, The 100 GF (GigaFlop) PARAM 10000 at the National Param Super-

computing Facility (NPSF) at Pune and the 1 TF (TeraFlop) PARAM Padma

at the C-DAC’s Terascale Supercomputing Facility (CTSF) at Bangalore. The

indigenously built PARAM Padma debuted on the Top500 list of supercomputers

at 171 in May 2003.

After the completion of PARAM Padma (1 TF peak computing power, subse-

quently upgraded by another 1TF peak) in December 2002 and it’s dedication to

the nation in June 2003, it was used extensively as a third party facility (CTSF)

by a wide spectrum of users from academia, research labs and end-user agencies.

In addition, C-DAC has been actively working since then to build its Next Gener-

ation HPC system (Param NG) and associated technology components. C-DAC

commissioned the System called PARAM ”Yuva” in November 2008. This system

with Rmax (Sustained Performance) of 37.80 TFs and Rpeak (Peak Performance)

of 54.01 TFs, has been ranked at One Hundred Nine (109th) in TOP500 Systems

enlisted, as per the analysis released in June 2009. The system is an intermediate

milestone of C-DAC’s HPC Roadmap towards Petaflop Computing by 2012.

C-DAC has made significant contributions to the Indian HPC arena in terms of

awareness (by means of training programmes), consultancy, skilled manpower and

technology development as well as through deployment of systems and solutions

for use by the scientific, engineering and business community.

2.3.1 C-DAC and HPC

C-DAC has taken the initiative in conducting national awareness programs in

High Performance computing for the Scientific and Engineering community and

welcome to establish High Performance Computing Labs in all the universities and

colleges. This shall help in capacity building and act as a computational research

centers for the scientific & academic programs which will address & catalyse the

7

2.4 PARAM Yuva

Figure 2.1: Evolution of PARAM Supercomputers & HPC Roadmap

impact of high quality engineering education and high-end computational work

for the research community in the eastern region. It will also promote research

and teaching by integrating leading edge, high performance computing and visu-

alization for the faculties, students, graduate and post graduates of the institute

and will provide solutions to many of our most pressing national challenges.

2.4 PARAM Yuva

The latest in the series is called PARAM Yuva, which was developed last year

and was ranked 68th in the TOP500 list released in November 2008 at the Super-

computing Conference in Austin, Texas, United States. The system, according to

C-DAC scientists, is an intermediate milestone of C-DACs HPC road map towards

achieving petaflops (million billion flops) computing speed by 2012.

As part of this, C-DAC has also set up a National PARAM Supercomputing

Facility (NPSF) in Pune, where C-DAC is headquartered, to allow researchers

access to HPC systems to address their computer-intensive problems. C-DACs

efforts in this strategically and economically important area have thus put India

on the supercomputing map of the world along with select developed nations of

the world. As of 2008, 52 PARAM systems have been deployed in the country and

abroad, eight of them at locations in Russia, Singapore, Germany and Canada.

The PARAM series of cluster computing systems is based on what is called

OpenFrame Architecture. PARAM Yuva, in particular, uses a high-speed 10 gi-

gabits per second (Gbps) system area network called PARAM Net-3, developed

8

2.4 PARAM Yuva

indigenously by C-DAC over the last three years, as the primary interconnect.

This HPC cluster system is built with nodes designed around state-of-the-art ar-

chitecture known as X-86 based on Quad Core processors. In all, PARAM Yuva,

in its complete configuration, has 4,608 cores of Intel Xeon 73XX processors called

Tigerton with a clock speed of 2.93 gigahertz (GHz). The system has a sustained

performance of 37.8 Tflops and a peak speed of 54 Tflops.

Figure 2.2: Block Diagram of PARAM Yuva

A novel feature of PARAM Yuva is its reconfigurable computing (RC) capabil-

ity, which is an innovative way of speeding up HPC applications by dynamically

configuring hardware to a suite of algorithms or applications run on PARAM Yuva

for the first time. The RC hardware essentially uses acceleration cards as external

add-ons to boost speed significantly while saving on power and space. C-DAC is

one of the first organisations to bring the concept of reconfigurable hardware re-

sources to the country. C-DAC has not only implemented the latest RC hardware,

it has also developed system software and hardware libraries to achieve appropriate

accelerations in performance.

As C-DAC has been scaling different milestones in HPC hardware, it has also

been developing HPC application software, providing end-to-end solutions in an

HPC environment to different end-users on mission mode. Only in early January,

C-DAC set up a supercomputing facility around a scaled-down version of PARAM

Yuva at North-Eastern Hill University (NEHU) in Shillong complete with all allied

C-DAC technology components and application software.

9

2.5 Grid Computing

2.5 Grid Computing

Grid computing is a term referring to the federation of computer resources from

multiple administrative domains to reach a common goal. The grid can be thought

of as a distributed system with non-interactive workloads that involve a large

number of files. What distinguishes grid computing from conventional high per-

formance computing systems such as cluster computing is that grids tend to be

more loosely coupled, heterogeneous, and geographically dispersed. Although a

grid can be dedicated to a specialized application, it is more common that a single

grid will be used for a variety of different purposes. Grids are often constructed

with the aid of general-purpose grid software libraries known as middleware.

Grid size can vary by a considerable amount. Grids are a form of distributed

computing whereby a super virtual computer is composed of many networked

loosely coupled computers acting together to perform very large tasks. For certain

applications, distributed or grid computing, can be seen as a special type of parallel

computing that relies on complete computers (with onboard CPUs, storage, power

supplies, network interfaces, etc.) connected to a network (private, public or the

Internet) by a conventional network interface, such as Ethernet. This is in contrast

to the traditional notion of a supercomputer, which has many processors connected

by a local high-speed computer bus.

2.5.1 GARUDA: The National Grid Computing Initiative

of India

GARUDA is a collaboration of science researchers and experimenters on a nation-

wide grid of computational nodes, mass storage and scientific instruments that

aims to provide the technological advances required to enable data and compute

intensive science for the 21st century. One of GARUDA’s most important chal-

lenges is to strike the right balance between research and the daunting task of

deploying innovation into some of the most complex scientific and engineering

endeavors being undertaken today.

Building a commanding position in Grid computing is crucial for India. By

allowing researchers to easily access supercomputer-level processing power and

knowledge resources, grids will underpin progress in Indian science, engineering

and business. The challenge facing India today is to turn technologies developed

for researchers into industrial strength business tools.

The Department of Information Technology (DIT), Government of India has

funded the Centre for Development of Advanced Computing (C-DAC) to deploy

10

2.6 Flynn’s Taxonomy

the nationwide computational grid GARUDA’ which will connect 17 cities across

the country in its Proof of Concept (PoC) phase with an aim to bring ”Grid”

networked computing to research labs and industry. GARUDA will accelerate

India’s drive to turn its substantial research investment into tangible economic

benefits.

2.5.2 Garuda: Objectives

GARUDA aims at strengthening and advancing scientific and technological excel-

lence in the area of Grid and Peer-to-Peer technologies. The strategic objectives of

GARUDA are to: Create a test bed for the research and engineering of technolo-

gies, architectures, standards and applications in Grid Computing Bring together

all potential research, development and user groups who can help develop a na-

tional initiative on Grid computing Create the foundation for the next generation

grids by addressing long term research issues in the strategic areas of knowledge

and data management, programming models, architectures, grid management and

monitoring, problem solving environments, grid tools and services

The following key deliverables have been identified as important to achieving

the GARUDA objectives: Grid tools and services to provide an integrated in-

frastructure to applications and higher-level layers A Pan-Indian communication

fabric to provide seamless and high-speed access to resources Aggregation of re-

sources including compute clusters, storage and scientific instruments Creation of

a consortium to collaborate on grid computing and contribute towards the ag-

gregation of resources Grid enablement and deployment of select applications of

national importance requiring aggregation of distributed resources

To achieve the above objectives, GARUDA brings together a critical mass of

well-established researchers from 45 research laboratories and academic institu-

tions that have formulated an ambitious program of activities.


The four classifications defined by Flynn are based upon the number of concurrent

instruction (or control) and data streams available in the architecture:

Single Instruction, Single Data stream (SISD)

A sequential computer which exploits no parallelism in either the instruction or

data streams. Single control unit (CU) fetches single Instruction Stream (IS) from

11


memory. The CU then generates appropriate control signals to direct single pro-

cessing element (PE) to operate on single Data Stream (DS) i.e. one operation at

a time.

Figure 2.3: Single Instruction, Multiple Data streams (SISD)

Examples of SISD architecture are the traditional uniprocessor machines like

a PC (currently manufactured PCs have multiple processors) or old mainframes.

Single Instruction, Multiple Data streams (SIMD)

Figure 2.4: Single Instruction, Multiple Data streams (SIMD)

A computer which exploits multiple data streams against a single instruction

stream to perform operations which may be naturally parallelized. For example,

an array processor or GPU.

Multiple Instruction, Single Data stream (MISD)

Multiple instructions operate on a single data stream. Uncommon architecture

which is generally used for fault tolerance. Heterogeneous systems operate on the

same data stream and must agree on the result.

12

2.7 Single Program, Multiple Data (SPMD)

Figure 2.5: Multiple Instruction, Single Data stream (MISD)

Examples include the Space Shuttle flight control computer.

Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous

processors simultaneously executing different instructions on different data. Dis-

tributed systems are generally recognized to be MIMD architectures; either ex-

ploiting a single shared memory space or a distributed memory space. A multi-core

superscalar processor is an MIMD processor.

Figure 2.6: Multiple Instruction, Multiple Data streams (MIMD)

2.7 Single Program, Multiple Data (SPMD)

Proposed cluster is mostly using variation of MIMD category i.e. SPMD. Multiple

autonomous processors simultaneously executing the same program (but at inde-

pendent points, rather than in the lockstep that SIMD imposes) on different data.

Also referred to as ’Single Process, multiple data’ - the use of this terminology for

SPMD is erroneous and should be avoided, SPMD is a parallel execution model

and assumes multiple cooperating processes executing a program. SPMD is the

13

2.8 Message Passing and Parallel Programming Protocols

most common style of parallel programming. The SPMD model and the term was

proposed by Frederica Darema.

2.8 Message Passing and Parallel Programming

Protocols

Message passing is a form of communication used in parallel computing, object-

oriented programming, and interprocess communication. In this model processes

or objects can send and receive messages (comprising zero or more bytes, complex

data structures, or even segments of code) to other processes. By waiting for

messages, processes can also synchronize.

Three protocols are presented here for parallel programming, one which has

become the standard, one which used to be the standard, and one which some

feel might be the next big thing. For a while there, the parallel protocol war was

being waged over PVM and MPI. By most everyone’s account, MPI won. It is a

highly efficient and easy to learn protocol that has been implemented on a wide

variety of platforms. One criticism is that different implementations of MPI don’t

always talk to one another. However, most cluster install packages give both of

the two most common implementations (MPICH and LAM/MPI). If setting up a

small cluster, choose freely between either, they both work well, and as long as

there is same version of MPI on each machine, there is no need to rewrite any MPI

code. MPI stands for Message Passing Interface. Basically, independent processes

send messages to each other. Both LAM/MPI and MPICH simplify the process

of starting large jobs on multiple machines. It is the most common and efficient

parallel protocol in current use.

2.8.1 Message Passing Models

Message passing models for parallel computation have been widely adopted be-

cause of their similarity to the physical attributes of many multiprocessor architec-

tures. Probably the most widely adopted message passing model is MPI. MPI, or

Message Passing Interface, was released in 1994 after two years in the design phase.

MPIs functionality is fairly straightforward. For several years, MPI has been the

de facto standard for writing parallel applications. One of the most popular MPI

implementations is MPICH. Its successor, MPICH2, features a completely new

design that provides more performance and flexibility. To ensure portability, it

has a hierarchical structure based on which porting can be done at different levels.

14


MPICH2 programs are written in C or FORTRAN and linked against the MPI

libraries; C++ and Fortran90 bindings are also supported. MPI applications run

in a multiple-instruction multiple-data (MIMD) manner.

MPI

MPI provides a straight-forward interface to write software that can use multiple

cores of a computer, and multiple computers in a cluster or nodes in a supercom-

puter. Using MPI write code that uses all of the cores and all of the nodes in

a multicore computer cluster, and that will run faster as more cores and more

compute nodes become available.

MPI is a well-established, standard method of writing parallel programs. It

was first released in 1992, and is currently on version 2.1.4.1. MPI is implemented

as a library, which is available for nearly all computer platforms (e.g. Linux,

Windows, OS X), and with interfaces for many popular languages (e.g. C, C++,

Fortran, Python).

MPI stands for ”Message Passing Interface”, and it parallelizes computational

work by providing tools that use a team of processes to solve the problem, and

for the team to then share the solution by passing messages amongst one another.

MPI can be used to parallelize programs that run locally, by having all processes

in the team run locally, or it can be used to parallelize programs across a compute

cluster, by running one or more processes per node. MPI can be combined with

other parallel programming technologies, e.g. OpenMP.

Basic MPI Calls

It is often said that there are two views of MPI. One view is that MPI is a

lightweight protocol with only 6 commands. The other view is that it is a in

depth protocol with hundreds of specialized commands.

The 6 Basic MPI Commands

• MPI Init

• MPI Comm size

• MPI Comm rank

• MPI Send

• MPI Recv

15


• MPI Finalize

In short, set up an MPI program, get the number of processes participating in

the program, determine which of those processes corresponds to the one calling the

command, send messages, receive messages, and stop participating in a parallel

program.

1. MPI Init(int *argc, char ***argv) Takes the command line arguments to a

program, checks for any MPI options, and passes remaining command line

arguments to the main program.

2. MPI Comm size( MPI Comm comm, int *size ) Determines the size of a

given MPI Communicator. A communicator is a set of processes that work

together. For typical programs this is the default MPI COMM WORLD,

which is the communicator for all processes available to an MPI program.

3. MPI Comm rank( MPI Comm comm, int *rank ) Determine the rank of the

current process within a communicator. Typically, if a MPI program is being

run on N processes, the communicator would be MPI COMM WORLD, and

the rank would be an integer from 0 to N-1.

4. MPI Send( void *buf, int count, MPI Datatype datatype, int dest, int tag,

MPI Comm comm ) Send the contents of buf, which contains count elements

of type datatype to a process of rank dest in the communicator comm, flagged

with the message tag. Typically, the communicator is MPI COMM WORLD.

5. MPI Recv( void *buf, int count, MPI Datatype datatype, int source, int tag,

MPI Comm comm, MPI Status *status ) Read into buf count values of type

datatype from process source in communicator comm if a message is sent

flagged with tag. Also receive information about the transfer into status.

6. MPI Finalize() Handles anything that the current MPI protocol will need

to do before exiting a program. Typically should be the final or near final

line of a program.

MPICH2: Message Passing Interface

The MPICH implementation of MPI is one of the most popular versions of MPI.

Recently, MPICH was completely rewritten; the new version is called MPICH2 and

includes all of MPI, both MPI-1 and MPI-2. This section describes how to obtain,

16


Figure 2.7: General MPI Program Structure

build, and install MPICH2 on a Beowulf cluster. Then it describes how to set up

an MPICH2 environment in which MPI programs can be compiled, executed, and

debugged. MPICH2 is recommended for all Beowulf clusters by many researchers.

Original MPICH is still available but is no longer being developed.

PVM

PVM (Parallel Virtual Machine) is a freely-available, portable, message-passing

library generally implemented on top of sockets. PVMs daemon based implemen-

tation makes it easy to start large jobs on multiple machines. PVM was the first

standard for parallel computing to become widely accepted. As a result, there is

a large amount of legacy code in PVM still available. PVM also allows for the

ability to spawn multiple programs from within the original program. PVM easily

recursively spawn other processes. It is simple implementation that works across

different platforms. Now a days people having legacy code in PVM that they don’t

want to modify are using it.

JavaSpaces

Java is a versatile computer language that is object oriented and is widely used

in computer science schools around the country. JavaSpaces is Java’s parallel

programming framework which operates by writing entries into a shared space.

Programs can access the space, and either add an entry, read an entry without

removing it, or take an entry.

17

2.9 Speedup and Efficiency

Java is an interpreted language, and as such typical programs will not run at

the same speed as compiled languages such as C/C++ and Fortran. However,

much progress has been made in the area of Java efficiency, and many operating

systems have what are known as just-in-time compilers. Current claims are that a

well optimized java platform can run java code at about 90% of the speed of similar

C/C++ code. Java has a versatile security policy that is extremely flexible, but

also can be difficult to learn.

JavaSpaces suffers from high latency and a lack of network optimization, but

for embarrasingly parallel problems that do not require synchronization, the JavaS-

paces model of putting jobs into a space, letting any ”worker” take jobs out of

the space, and having the workers put results into the space when done leads to

very natural approaches to load balancing and may be well suited to non-coupled

highly distributed computations, such as SETI@Home. JavaSpaces does not have

any simple mechanism for starting large jobs on multiple machines. Javaspaces

is good choice if need to pass not just data, but instructions on what to do with

that data. Also it provides object oriented parallel framework.


2.9.1 Speedup

The speedup of a parallel code is how much faster it runs in parallel. If the time

it takes to run a code on 1 processors is T1 and the time it takes to run the same

code on N processors is TN, then the speedup is given by

S =T1TN

This can depend on many things, but primarily depends on the ratio of the

amount of time the code spends communicating to the amount of time it spends

computing.

2.9.2 Efficiency

Efficiency is a measure of how much of available processing power is being used.

The simplest way to think of it is as the speedup per processor. This is equivalent

to defining efficiency as the time to run N models on N processors to the time to

18


run 1 model on 1 processor.

E =S

N=

T1N × TN

This gives a more accurate measure of the true efficiency of a parallel program

than CPU usage, as it takes into account redundant calculations as well as idle

time.

2.9.3 Factors affecting performance

The factors which can affect an MPI application’s performance are numerous,

complex and interrelated. Because of this, generalizing about an application’s

performance is usually very difficult. Most of the important factors are briefly

described below.

Platform / Architecture Related

1. cpu - clock speed, number of cpus

2. Memory subsystem - memory and cache configuration, memory-cache-cpu

bandwidth, memory copy bandwidth

3. Network adapters - type, latency and bandwidth characteristics

4. Operating system characteristics - many

Network Related

1. Protocols - TCP/IP, UDP/IP, other

2. Configuration, routing, etc

3. Network tuning options (”no” command)

4. Network contention / saturation

Application Related

1. Algorithm efficiency and scalability

2. Communication to computation ratios

3. Load balance

19


4. Memory usage patterns

5. I/O

6. Message size used

7. Types of MPI routines used - blocking, non-blocking, point-to-point, collec-

tive communications

MPI Implementation Related

1. Message buffering

2. Message passing protocols - eager, rendezvous, other

3. Sender-Receiver synchronization - polling, interrupt

4. Routine internals - efficiency of algorithm used to implement a given routine

Network Contention

1. Network contention occurs when the volume of data being communicated

between MPI tasks saturates the bandwidth of the network.

2. Saturation of the network bandwidth results in an overall decrease of com-

munications performance for all tasks.

Because of these challenges and complexities, performance analysis tools are essen-

tial to optimizing an application’s performance. They can assist in understanding

what program is ”really doing” and suggest how program performance should be

improved.

The primary issue with speedup is the communication to computation ratio.

To get a higher speedup,

• Communicate less

• Compute more

• Make connections faster

• Communicate faster

20


The amount of time the computer requires to make a connection to another

computer is referred to as its latency, and the rate at which data can be transferred

is the bandwidth. Both can have an impact on the speedup of a parallel code.

Collective communication can also help speed up the code. As an example,

imagine you are trying to tell a number of people about a party. One method would

be to tell each person individually, another would be to tell people to ”spread the

word”. Collective communication refers to improving communication speed by

having any node with the information being sent participate in sending the infor-

mation to other nodes. Not all protocols allow for collective communication, and

even protocols which do may not require a vendor to implement collective com-

munication. An example is the broadcast routine in MPI. Many vendor specific

versions of MPI allow for broadcast routines which use a ”tree” method of commu-

nications. The more common implementation found on most clusters, openMPI,

LAM-MPI and MPICH, simply have the sending machine contact each receiving

machine in turn.

2.9.4 Amdahl’s Law

Amdahl’s law, also known as Amdahl’s argument, is named after computer ar-

chitect Gene Amdahl, and is used to find the maximum expected improvement

to an overall system when only part of the system is improved. It is often used

in parallel computing to predict the theoretical maximum speedup using multiple

processors.

Figure 2.8: Speedup of a program using multiple processors

21

2.10 Maths Libraries

OverallSpeedup =1

(1− f) + fs

where,

f-fraction of parallel code

s-speedup of enhanced portion The speedup of a program using multiple proces-

sors in parallel computing is limited by the time needed for the sequential fraction

of the program. For example, if a program needs 20 hours using a single processor

core, and a particular portion of 1 hour cannot be parallelized, while the remaining

promising portion of 19 hours (95%) can be parallelized, then regardless of how

many processors are devoted to a parallelized execution of this program, the min-

imum execution time cannot be less than that critical 1 hour. Hence the speedup

is limited up to 20x, as the diagram illustrates.


For computer programmers, calling pre-written subroutines to do complex calcu-

lations dates back to early computing history. With minimal effort, any developer

can write a function that multiplies two matrices, but these same developers would

not want to re-write that function for every new program that requires it. Fur-

ther, with good theory and practice, one can optimize practically any algorithm

to run several times faster, though it would typically take a several hours to days

to match the performance of a highly optimized algorithm.

Scientific computing, and the use of math libraries, was traditionally limited

to research labs and engineering disciplines. In recent decades, this niche com-

puting market has blossomed across a variety of industries. While research in-

stitutes and universities are still the largest users of math libraries, especially in

the High Performance Computing (HPC) arena, industries like financial services

and biotechnology are increasingly turning to math libraries as well. Even the

business analytics arena around business intelligence and data mining is starting

to leverage the existing tools. From bond pricing and portfolio optimization to

exotic instrument evaluations and exchange rate analysis, the financial services

industry has a wide variety of requirements for complex mathematical algorithms.

Similarly, the biology disciplines have aligned with statisticians to analyze exper-

imental procedures which produce hundreds of thousands of results.

The core area of the math library market implements linear algebra algorithms.

22


More specialized functions, such as numerical optimization and time series fore-

casting, are often invoked explicitly by users. In contrast, linear algebra functions

are often used as key background components for solving a wide variety of prob-

lems. Eigen analysis, matrix inversion and other linear calculations are essential

components in nearly every statistical analysis in use today including regression,

factor analysis, discriminate analysis, etc. The most basic suite of such algorithms

is the BLAS (Basic Linear Algebra Subprograms) libraries for basic vector and

matrix operations.

BLAS

BLAS is the Basic Linear Algebra Subprograms. It is a set of routines used to

perform common low level matrix manipulations such as rotations, or dot prod-

ucts. BLAS should be optimized to run on given hardware. This can be done by

getting a vendor supplied package (ie, provided by Sun, or Intel), or else by using

the ATLAS software.

ATLAS

ATLAS is the Automatically Tuned Linear Algebra Software package. It is soft-

ware that attempts to tune the BLAS implementation that it provides to hardware.

ATLAS also provides a very minimal LAPACK implementation, so it is better to

install the complete LAPACK package separately.

LAPACK

LAPACK is the Linear Algebra Package. It extends BLAS to provide higher level

linear algebra routines such as computing eigenvalues, or finding the solutions to

a system of linear equations. LAPACK is a library of Fortran 77 subroutines for

solving the most commonly occurring problems in numerical linear algebra. It has

been designed to be efficient on a wide range of modern high-performance comput-

ers. The name LAPACK is an acronym for Linear Algebra PACKage. Previously

LINPACK was used for benchmarking. LINPACK is a collection of Fortran sub-

routines that analyse and solve linear equations and linear least-squares problems.

But now it is completely superseded by LAPACK.

Problems that LAPACK can Solve

LAPACK can solve systems of linear equations, linear least squares problems,

eigenvalue problems and singular value problems. LAPACK can also handle

many associated computations such as matrix factorizations or estimating con-

23

2.11 HPL Benchmark

dition numbers.

LAPACK contains driver routines for solving standard types of problems, com-

putational routines to perform a distinct computational task, and auxiliary rou-

tines to perform a certain subtask or common low-level computation. Each driver

routine typically calls a sequence of computational routines. Taken as a whole,

the computational routines can perform a wider range of tasks than are covered

by the driver routines. Many of the auxiliary routines may be of use to numerical

analysts or software developers, so they documented the Fortran source for these

routines with the same level of detail used for the LAPACK routines and driver

routines.

Dense and band matrices are provided for, but not general sparse matrices. In

all areas, similar functionality is provided for real and complex matrices.

2.11 HPL Benchmark

HPL is a software package that solves a (random) dense linear system in double

precision (64 bits) arithmetic on distributed-memory computers. It can thus be

regarded as a portable as well as freely available implementation of the High

Performance Computing Linpack Benchmark.

The algorithm used by HPL can be summarized by the following keywords:

Two-dimensional block-cyclic data distribution - Right-looking variant of the LU

factorization with row partial pivoting featuring multiple look-ahead depths - Re-

cursive panel factorization with pivot search and column broadcast combined -

Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast

algorithm - backward substitution with look-ahead of depth 1.

The HPL package provides a testing and timing program to quantify the ac-

curacy of the obtained solution as well as the time it took to compute it. The

best performance achievable by this software on system depends on a large variety

of factors. Nonetheless, with some restrictive assumptions on the interconnection

network, the algorithm described here and its attached implementation are scal-

able in the sense that their parallel efficiency is maintained constant with respect

to the per processor memory usage.

The HPL software package requires the availability of an implementation of

the Message Passing Interface MPI on system. An implementation of either the

Basic Linear Algebra Subprograms BLAS or the Vector Signal Image Processing

Library VSIPL is also needed. Machine-specific as well as generic implementations

of MPI, the BLAS and VSIPL are available for a large variety of systems.

24

2.11 HPL Benchmark

2.11.1 Description of the HPL.dat File

Line 1: (unused) Typically one would use this line for its own good. For example,

it could be used to summarize the content of the input file. By default this line

reads:

HPL Linpack benchmark input file

Line 2: (unused) same as line 1. By default this line reads:

Innovative Computing Laboratory, University of Tennessee

Line 3: the user can choose where the output should be redirected to. In the

case of a file, a name is necessary, and this is the line where one wants to specify

it. Only the first name on this line is significant. By default, the line reads:

HPL.out output file name (if any)

This means that if one chooses to redirect the output to a file, the file will be called

”HPL.out”. The rest of the line is unused, and this space to put some informative

comment on the meaning of this line.

Line 4: This line specifies where the output should go. The line is formatted,

it must begin with a positive integer, the rest is unsignificant. 3 choices are possi-

ble for the positive integer, 6 means that the output will go the standard output,

7 means that the output will go to the standard error. Any other integer means

that the output should be redirected to a file, which name has been specified in

the line above. This line by default reads:

6 device out (6=stdout,7=stderr,file)

which means that the output generated by the executable should be redirected to

the standard output.

Line 5: This line specifies the number of problem sizes to be executed. This

number should be less than or equal to 20. The first integer is significant, the rest

is ignored. If the line reads:

3 # of problems sizes (N)

this means that the user is willing to run 3 problem sizes that will be specified in

the next line.

Line 6: This line specifies the problem sizes one wants to run. Assuming the

line above started with 3, the 3 first positive integers are significant, the rest is

ignored. For example:

25

2.11 HPL Benchmark

3000 6000 10000 Ns

means that one wants xhpl to run 3 (specified in line 5) problem sizes, namely

3000, 6000 and 10000.

Line 7: This line specifies the number of block sizes to be runned. This num-

ber should be less than or equal to 20. The first integer is significant, the rest is

ignored. If the line reads:

5 # of NBs

this means that the user is willing to use 5 block sizes that will be specified in the

next line.

Line 8: This line specifies the block sizes one wants to run. Assuming the line

above started with 5, the 5 first positive integers are significant, the rest is ignored.

For example:

80 100 120 140 160 NBs

means that one wants xhpl to use 5 (specified in line 7) block sizes, namely 80,

100, 120, 140 and 160.

Line 9: This line specifies how the MPI processes should be mapped onto the

nodes of platform. There are currently two possible mappings, namely row- and

column-major. This feature is mainly useful when these nodes are themselves

multi-processor computers. A row-major mapping is recommended.

0 PMAP process mapping (0=Row-,1=Column-major)

Line 10: This line specifies the number of process grid to be runned. This num-

ber should be less than or equal to 20. The first integer is significant, the rest is

ignored. If the line reads:

2 # of process grids (P x Q)

this means that it will try 2 process grid sizes that will be specified in the next line.

Line 11-12: These two lines specify the number of process rows and columns

of each grid to run on. Assuming the line above (10) started with 2, the 2 first

positive integers of those two lines are significant, the rest is ignored. For example:

1 2 Ps

6 8 Qs

means that one wants to run xhpl on 2 process grids (line 10), namely 1-by-6 and

2-by-8. Note: In this example, it is required then to start xhpl on at least 16

26

2.11 HPL Benchmark

nodes (max of Pi-by-Qi). The runs on the two grids will be consecutive. If one

was starting xhpl on more than 16 nodes, say 52, only 6 would be used for the

first grid (1x6) and then 16 (2x8) would be used for the second grid. The fact

that you started the MPI job on 52 nodes, will not make HPL use all of them. In

this example, only 16 would be used. If one wants to run xhpl with 52 processes

one needs to specify a grid of 52 processes, for example the following lines would

do the job:

4 2 Ps

13 8 Qs

Line 13: This line specifies the threshold to which the residuals should be com-

pared with. The residuals should be or order 1, but are in practice slightly less

than this, typically 0.001. This line is made of a real number, the rest is not

significant. For example:

16.0 threshold

In practice, a value of 16.0 will cover most cases. For various reasons, it is possible

that some of the residuals become slightly larger, say for example 35.6. xhpl will

flag those runs as failed, however they can be considered as correct. A run should

be considered as failed if the residual is a few order of magnitude bigger than 1

for example 106 or more. Note: if one was to specify a threshold of 0.0, all tests

would be flagged as failed, even though the answer is likely to be correct. It is

allowed to specify a negative value for this threshold, in which case the checks will

be by-passed, no matter what the threshold value is, as soon as it is negative. This

feature allows to save time when performing a lot of experiments, say for instance

during the tuning phase. Example:

-16.0 threshold

The remaning lines allow to specifies algorithmic features. xhpl will run all

possible combinations of those for each problem size, block size, process grid com-

bination. This is handy when one looks for an ”optimal” set of parameters. To

understand a little bit better, let say first a few words about the algorithm imple-

mented in HPL. Basically this is a right-looking version with row-partial pivoting.

The panel factorization is matrix-matrix operation based and recursive, dividing

the panel into NDIV subpanels at each step. This part of the panel factorization is

denoted below by ”recursive panel fact. (RFACT)”. The recursion stops when the

current panel is made of less than or equal to NBMIN columns. At that point, xhpl

uses a matrix-vector operation based factorization denoted below by ”PFACTs”.

27

2.11 HPL Benchmark

Classic recursion would then use NDIV=2, NBMIN=1. There are essentially 3 nu-

merically equivalent LU factorization algorithm variants (left-looking, Crout and

right-looking). In HPL, one can choose every one of those for the RFACT, as well

as the PFACT. The following lines of HPL.dat allows to set those parameters.

Lines 14-21: (Example 1)

3 # of panel fact

0 1 2 PFACTs (0=left, 1=Crout, 2=Right)

4 # of recursive stopping criterium

1 2 4 8 NBMINs (>= 1)

3 No. of panels in recursion

2 3 4 NDIVs

3 No. of recursive panel fact.

0 1 2 RFACTs (0=left, 1=Crout, 2=Right)

This example would try all variants of PFACT, 4 values for NBMIN, namely 1, 2,

4 and 8, 3 values for NDIV namely 2, 3 and 4, and all variants for RFACT.


2 # of panel fact

2 0 PFACTs (0=left, 1=Crout, 2=Right)


4 8 NBMINs (>= 1)

1 # of panels in recursion

2 NDIVs

1 # of recursive panel fact.

2 RFACTs (0=left, 1=Crout, 2=Right)

This example would try 2 variants of PFACT namely right looking and left look-

ing, 2 values for NBMIN, namely 4 and 8, 1 value for NDIV namely 2, and one

variant for RFACT.

In the main loop of the algorithm, the current panel of column is broadcast

in process rows using a virtual ring topology. HPL offers various choices and one

most likely want to use the increasing ring modified encoded as 1. 3 and 4 are

also good choices.


1 # of broadcast

1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

This will cause HPL to broadcast the current panel using the increasing ring mod-

ified topology.

28

2.11 HPL Benchmark


2 # of broadcast

0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

This will cause HPL to broadcast the current panel using the increasing ring vir-

tual topology and the long message algorithm.

Lines 24-25 allow to specify the look-ahead depth used by HPL. A depth of 0

means that the next panel is factorized after the update by the current panel is

completely finished. A depth of 1 means that the next panel is immediately fac-

torized after being updated. The update by the current panel is then finished. A

depth of k means that the k next panels are factorized immediately after being

updated. The update by the current panel is then finished. It turns out that a

depth of 1 seems to give the best results, but may need a large problem size before

one can see the performance gain. So use 1, if you do not know better, otherwise

you may want to try 0. Look-ahead of depths 3 and larger will probably not give

better results.

Lines 24-25: (Example 1):

1 No. of lookahead depth

1 DEPTHs (>= 0)

This will cause HPL to use a look-ahead of depth 1.



0 1 DEPTHs (>= 0)

This will cause HPL to use a look-ahead of depths 0 and 1.

Lines 26-27 allow to specify the swapping algorithm used by HPL for all tests.

There are currently two swapping algorithms available, one based on ”binary ex-

change” and the other one based on a ”spread-roll” procedure (also called ”long”

below). For large problem sizes, this last one is likely to be more efficient. The

user can also choose to mix both variants, that is ”binary-exchange” for a number

of columns less than a threshold value, and then the ”spread-roll” algorithm. This

threshold value is then specified on Line 27.


1 SWAP (0=bin-exch,1=long,2=mix)

60 swapping threshold

This will cause HPL to use the ”long” or ”spread-roll” swapping algorithm. Note

that a threshold is specified in that example but not used by HPL.

29

2.11 HPL Benchmark




This will cause HPL to use the ”long” or ”spread-roll” swapping algorithm as

soon as there is more than 60 columns in the row panel. Otherwise, the ”binary-

exchange” algorithm will be used instead.

Line 28 allows to specify whether the upper triangle of the panel of columns

should be stored in no-transposed or transposed form. Example:

0 L1 in (0=transposed,1=no-transposed) form

Line 29 allows to specify whether the panel of rows U should be stored in no-

transposed or transposed form. Example: 0 U in (0=transposed,1=no-transposed)

form

Line 30 enables / disables the equilibration phase. This option will not be used

unless 1 or 2 are selected in Line 26. Example:

1 Equilibration (0=no,1=yes)

Line 31 allows to specify the alignment in memory for the memory space allo-

cated by HPL. On modern machines, one probably wants to use 4, 8 or 16. This

may result in a tiny amount of memory wasted. Example:

8 memory alignment in double (> 0)

2.11.2 Guidelines for HPL.dat configuration

1. Figure out a good block size for the matrix multiply routine. The best

method is to try a few out. If the block size used by the matrix-matrix

multiply routine is known, a small multiple of that block size will do fine.

This particular topic is discussed in the FAQs section.

2. The process mapping should not matter if the nodes of platform are sin-

gle processor computers. If these nodes are multi-processors, a row-major

mapping is recommended.

3. HPL likes ”square” or slightly flat process grids. Unless very small process

grid is used, stay away from the 1-by-Q and P-by-1 process grids. This

particular topic is also discussed in the FAQs section.

30

2.11 HPL Benchmark

4. Panel factorization parameters: a good start are the following for the lines

14-21:

1 No. of panel fact

1 PFACTs (0=left, 1=Crout, 2=Right)

2 No. of recursive stopping criterium

4 8 NBMINs (>= 1)

1 No. of panels in recursion

2 NDIVs

1 No. of recursive panel fact.


5. Broadcast parameters: at this time it is far from obvious to me what the

best setting is, so i would probably try them all. If I had to guess I would

probably start with the following for the lines 22-23:

2 No. of broadcast

1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

The best broadcast depends on problem size and harware performance. Usu-

ally 4 or 5 may be competitive for machines featuring very fast nodes com-

paratively to the network.

6. Look-ahead depth: as mentioned above 0 or 1 are likely to be the best

choices. This also depends on the problem size and machine configuration,

so I would try ”no look-ahead (0)” and ”look-ahead of depth 1 (1)”. That

is for lines 24-25:


0 1 DEPTHs (>= 0)

7. Swapping: one can select only one of the three algorithm in the input file.

Theoretically, mix (2) should win, however long (1) might just be good

enough. The difference should be small between those two assuming a swap-

ping threshold of the order of the block size (NB) selected. If this threshold

is very large, HPL will usebinexch(0) most of the time and if it is very small

(< NB)− 27:



I would also try the long variant. For a very small number of processes in

every column of the process grid (say ¡ 4), very little performance difference

should be observable.

31

2.12 HPCC Challenge Benchmark

8. Local storage: I do not think Line 28 matters. Pick 0 in doubt. Line 29

is more important. It controls how the panel of rows should be stored.

No doubt 0 is better. The caveat is that in that case the matrix-multiply

function is called with ( Notrans, Trans, ... ), that is C := C−ABT . Unless

the computational kernel used has a very poor (with respect to performance)

implementation of that case, and is much more efficient with ( Notrans,

Notrans, ... ) just pick 0 as well. So, the choice:


0 U in (0=transposed,1=no-transposed) form

9. Equilibration: It is hard to tell whether equilibration should always be per-

formed or not. Not knowing much about the random matrix generated and

because the overhead is so small compared to the possible gain, I turn it on

all the time.


10. For alignment, 4 should be plenty, but just to be safe, one may want to pick

8 instead.



HPCC was developed to study future Petascale computing systems, and is in-

tended to provide a realistic measurement of modern computing workloads. HPCC

is made up of seven common computational kernels: STREAM, HPL, DGEMM

(matrix multiply), PTRANS (parallel matrix transpose), FFT, RandomAccess,

and b eff (bandwidth/latency tests). The benchmarks attempt to measure high

and low spatial and temporal locality space. The tests are scalable, and can be

run on a wide range of platforms, from single processors to the largest parallel

supercomputers.

The HPCC benchmarks test three particular regimes: local or single processor,

embarrassingly parallel, and global, where all processors compute and exchange

data with each other. STREAM measures a processor’s memory bandwidth. HPL

is the LINPACK TPP (Toward Peak Performance) benchmark; RandomAccess

measures the rate of random updates of memory; PTRANS measures the rate of

transfer of very large arrays of data from memory; b eff measures the latency and

bandwidth of increasingly complex communication patterns.

All of the benchmarks are run in two modes: base and optimized. The base

32


run allows no source modifications of any of the benchmarks, but allows gener-

ally available optimized libraries to be used. The optimized benchmark allows

significant changes to the source code. The optimizations can include alternative

programming languages and libraries that are specifically targeted for the platform

being tested.

The HPC Challenge benchmark consists at this time of 7 benchmarks: HPL,

STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b eff Latency/Bandwidth.

HPL ( system performance )

The Linpack TPP benchmark which measures the floating point rate of execu-

tion for solving a randomly generated dense linear system of equations in double

floating-point precision (IEEE 64-bit) arithmetic using MPI. The linear system

matrix is stored in a two-dimensional block-cyclic fashion and multiple variants

of code are provided for computational kernels and communication patterns. The

solution method is LU factorization through Gaussian elimination with partial

row pivoting followed by a backward substitution. Unit: Tera Flops per Second

PTRANS (A = A+BT ) ( system performance )

Implements a parallel matrix transpose for two-dimensional block-cyclic storage.

It is an important benchmark because it exercises the communications of the com-

puter heavily on a realistic problem where pairs of processors communicate with

each other simultaneously. It is a useful test of the total communications capacity

of the network. Unit: Giga Bytes per Second

RandomAccess ( system performance )

Global RandomAccess, also called GUPs, measures the rate at which the com-

puter can update pseudo-random locations of its memory - this rate is expressed

in billions (giga) of updates per second (GUP/s). Unit: Giga Updates per Second

FFTE ( system performance )

IT measures the floating point rate of execution of double precision complex one-

dimensional Discrete Fourier Transform (DFT). Global FFTE performs the same

test as FFTE but across the entire system by distributing the input vector in block

fashion across all the processes. Unit: Giga Flops per Second

STREAM ( system performance - derived )

The Embarrassingly Parallel STREAM benchmark is a simple synthetic bench-

33


mark program that measures sustainable memory bandwidth and the correspond-

ing computation rate for simple numerical vector kernels. It is run in embarrass-

ingly parallel manner - all computational processes perform the benchmark at the

same time, the arithmetic average rate is multiplied by the number of processes

for this value. ( EP-STREAM Triad * MPI Processes ) Unit: Giga Bytes per

Second

DGEMM ( per process )

The Embarrassingly Parallel DGEMM benchmark measures the floating-point ex-

ecution rate of double precision real matrix-matrix multiply performed by the

DGEMM subroutine from the BLAS (Basic Linear Algebra Subprograms). It is

run in embarrassingly parallel manner - all computational processes perform the

benchmark at the same time, the arithmetic average rate is reported. Unit: Giga

Flops per Second

Effective bandwidth benchmark (b eff)

Effective bandwidth benchmark a set of tests to measure latency and bandwidth

of a number of simultaneous communication patterns.

Random Ring Bandwidth ( per process )

Randomly Ordered Ring Bandwidth, reports bandwidth achieved in the ring com-

munication pattern. The communicating processes are ordered randomly in the

ring (with respect to the natural ordering of the MPI default communicator). The

result is averaged over various random assignments of processes in the ring. Unit:

Giga Bytes per second

Random Ring Latency ( per process )

Randomly-Ordered Ring Latency, reports latency in the ring communication pat-

tern. The communicating processes are ordered randomly in the ring (with respect

to the natural ordering of the MPI default communicator) in the ring. The re-

sult is averaged over various random assignments of processes in the ring. Unit:

micro-seconds

Giga-updates per second (GUPS) is a measure of computer performance. GUPS

is a measurement of how frequently a computer can issue updates to randomly

generated RAM locations. GUPS measurements stress the latency and especially

bandwidth capabilities of a machine.

34

Chapter 3

Design and Implementation

3.1 Beowulf Clusters: A Low cost alternative

Beowulf is not a particular product. It is a concept for clustering varying numbers

of small, relatively inexpensive computers running the Linux operating system.

The goal of Beowulf clustering is to create a parallel-processing supercomputer

environment at a price well below that of conventional supercomputers.

Figure 3.1: The Schematic structure of proposed cluster

A Beowulf Cluster is a PC cluster that normally runs under Linux OS. Each

PC (node) is dedicated to the work of the cluster and connected through a net-

work with other nodes. Figure 3.1 schematically shows the structure of a proposed

cluster. In this cluster, a master node controls other worker nodes by communicat-

ing through the network using the Message Passing Interface (MPI). A Proposed

35

3.2 Logical View of proposed Cluster

cluster will have better price/performance ratio and scalability than other parallel

computers due to the use of off-the-shelf components and Linux OS. It is easy and

economical to add more nodes as needed without changing software programs.

3.2 Logical View of proposed Cluster

The primary and most often used view is termed logical view and this is the view

that anybody is generally be interacting with when using a cluster. In this view,

the physical components are categorized and displayed in a layered manner, that

is, here the primary concern is the parallel applications, message passing library,

OS and interconnect.

Figure 3.2: Logical view of proposed cluster

3.3 Hardware Configuration

As it has been previously indicated a cluster is comprised of computers intercon-

nected through a LAN. Let’s talk first about the requirements of this cluster in

terms of hardware and then about the software that will run on this system.

3.3.1 Master Node

The master server provide access to the primary network and ensure availability

of the cluster. Server has Fast Ethernet connection to the network in order to

better keep up with the high speed of the PCs. Any system from Intel, AMD or

36

3.3 Hardware Configuration

other vendor can be used as server. Here PC with Intel i7-2600 processor and 4

GB RAM is used as server.

3.3.2 Compute Nodes

Build custom PCs from commodity off-the-shelf components requires a lot of work

to assemble the cluster but it can be fine tuned as per the need. Buy generic PCs

and shelves. May want keyboard switches for smaller configurations. For larger

configurations, a better solution would be use the serial ports from each machine

and connect to a terminal server. Or even custom rack-mount nodes can be used.

More expensive but saves space. May complicate cooling issues due to closely

packed components. For this complete setup use old unused PCs from the college.

Here for testing purpose similar PCs with Intel i7-2600 processor and 4 GB RAM

are used.

3.3.3 Network

As it has been previously indicated the computers in a cluster communicate us-

ing a network interconnection as can be seen in the Figure 3.3. The master and

the compute nodes have NICs and all the computers are connected to a switch

to perform the delivery of messages. The cost per port of an Ethernet Switch is

about four times larger than an Ethernet Hub but an Ethernet Switch will be used

due to the following reasons: An Ethernet Hub is a network device that acts as

a broadcast bus, where an input signal is amplified and distributed to all ports.

However only a couple of computers can communicate properly at once and if two

or more computers simultaneously send packets a collision will occur. Therefore,

the bandwidth of an Ethernet Hub is equivalent to the bandwidth of the communi-

cation link, 10Mb/s for standard Ethernet, 100Mb/s for Fast Ethernet and 1Gb/s

for Gigabit Ethernet. An Ethernet Switch provides more accumulated bandwidth

by allowing multiple simultaneous communications. If there are no conflicts in the

output ports, the Ethernet Switch can send multiple packets simultaneously. A

major disadvantage that clusters have compared to supercomputers is its latency.

The bandwidth of each computer could be increased using multiple NICs, which

is possible through what is known in Linux as Channel Bonding. It consists in the

simulation of a network interface linking multiple NICs so that applications will

only see a single interface. The access to the cluster is often made remotely, that

is the reason why the frontend will have two NICs, one to access the Internet and

another one to connect to other nodes in the cluster. The maximum bandwidth

37

3.4 Softwares

provided by the college end Ethernet is 100 Mb/s and minimum latency for fast

Ethernet is 80 microseconds. All cluster machines are connected through college’s

Ethernet.

Figure 3.3: The Network interconnection

3.4 Softwares

The system that has been designed and implemented uses the Linux kernel with

GNU applications. These applications range from servers and compilers.

1. Operating System: The operating system used is Linux based CentOS 6.2.

It is an enterprise-quality operating system, because it is based on the source

code of Red Hat Enterprise Linux, which has been tested and stabilized ex-

tensively prior to release. On the other hand, CentOS(Community ENTer-

prise Operating System) is completely free, open source, and no cost, offering

all of the user support and features of a community-run Linux distribution.

The version 6 has been chosen because it is the latest stable version. The op-

erating system that runs in the frontend includes the standard applications

of the distribution in addition to others required for the construction of the

cluster. The specific applications included for the construction of the cluster

are message-passing libraries, compilers, servers and software for monitoring

the resources of the cluster.

2. Message-passing libraries: In the parallel computation in order to perform

task resolutions and intensive calculations one must divide and distribute

independent tasks to the different computers using the message-passing li-

braries. There are several libraries of this type, the most well-known being

MPI and PVM (Parallel Virtual Machine). The system integrates MPI.

38

3.4 Softwares

The reason for this choice is that it is the most commonly used library by

the numerical analysis community for the passing of messages. Specifically

MPICH2 has been used in proposed system.

3. Compilers: Languages commonly used in parallel computing are C, C++,

Python and FORTRAN. For this reason the four programming languages

are supported within the system that has been developed integrating the

compilers gcc, g++ and gfortran.

4. Compute nodes: The operating system that runs in the nodes is basic Cen-

tOS 6.2 without GUI. It integrates the kernel and basic services which are

necessary for an adequate performance of the nodes. The unnecessary soft-

wares which are not needed for this purpose has been discarded. Therefore,

MPICH2 is included, as well as the compilers gcc, g + + and gfortran.

3.4.1 MPICH2

MPICH2 is architected so that a number of communication infrastructures can

be used. These are called ”devices.” The device that is most relevant for the

Beowulf environment is the channel device (also called ”ch3” because it is the

third version of the channel approach for implementing MPICH); this supports a

variety of communication methods and can be built to support the use of both

TCP over sockets and shared memory. In addition, MPICH2 uses a portable in-

terface to process management systems, providing access both to external process

managers (allowing the process managers direct control over starting and running

the MPI processes) and to the MPD scalable process manager that is included

with MPICH2. To run first MPI program, carry out the following steps for its

installation:

1. Download mpich2-1.4.1p1.tar.gz from www.mcs.anl.gov/mpi/mpich and copy

at /home/beowulf/sw/

2. Extract the contents in /home/beowulf/sw/

$tar xvfz mpich2-1.4.1p1.tar.gz

3. Create folder for installation

$mkdir /opt/mpich2-1.4.1p1

4. Create build directory

$mkdir /tmp/mpich2-1.4.1p1 $cd/tmp/mpich2-1.4.1p1

39

3.4 Softwares

5. configure ¡configure options¿ ¿& configure.log. Most users should specify a

prefix for the installation path when configuring:

$/home/beowulf/sw/mpich2-1.4.1p1/configure –prefix=/opt/mpich2-1.4.1p1

2 > &1 configure.log

6. By default, this creates the channel device for communication with TCP

over sockets. Now build.

$make 2 > &1 make.log

7. Install MPICH2 commands

$make install 2 > &1 install.log

8. Add the ’< prefix > /bin’ directory to path by adding below line in

$home/.bashrc file in home directory

$vi home/.bashrc

export PATH=< prefix > /bin : $PATH

9. Test mpich2 installation

$which mpicc

SSH login without password

Public key authentication allows to login to a remote host via the SSH proto-

col without a password and is more secure than password-based authentication.

Try creating a passwordless connection from master to node1 using public-key

authentication.

Create key

Press ENTER at every prompt.

[root@master]#ssh-keygen

Generating public/private rsa key pair.

Enter file in which to save the key (/home/user/.ssh/id rsa):

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/user/.ssh/id rsa.

Your public key has been saved in /home/user/.ssh/id rsa.pub.

The key fingerprint is:

b2:ad:a0:80:85:ad:6c:16:bd:1c:e7:63:4f:a0:00:15 user@host

The key’s randomart image is:

40

3.4 Softwares

[root@master]#

For added security ”’the key itself”’ would be protected using a strong ”passphrase”.

If a passphrase is used to protect the key, ssh-agent can be used to cache the

passphrase.

Copy key to remote host

[root@master]# ssh-copy-id root@node1

root@node1’s password:

Now try logging into the machine, with ”ssh ’root@node1’”, and check in:

.ssh/authorized keys

to make sure we haven’t added extra keys that you weren’t expecting.

[root@master]#

Login to remote host

Note that no password is required.

root@master# ssh root@node1

Last login: Tue May 18 12:47:53 2012 from 10.1.11.210

[root@node1]#

Also it is must to disable the firewall on all cluater machines so that the cluster

can work seamlessly. To achieve this first login as the root user. Next enter the

following three commands to disable firewall.

#service iptables save

#service iptables stop

#chkconfig iptables off

Now MPI programs can be run on the cluster.

Running MPI Program

The following assumes that MPICH2 is installed on all cluster machines running

CentOS 6.2 and every machine have access to every other via mpiexec command.

It is also assumed that to compile, copy, and run the code command line, or in a

terminal window is used. Typically, running an MPI program will consist of three

steps:

41

3.4 Softwares

Compile

Assuming that the code to compile is ready (If only binary executables are avail-

able, proceed to step 2, copy) it need to create an executable. This involves

compiling code with the appropriate compiler, linked against the MPI libraries.

It is possible to pass all options through a standard cc or f77 command, but

MPICH provides a ”wrapper” (mpicc for cc/gcc, mpicxx/mpic++ for c++/g++

on UNIX/Linux and mpif77 for f77) that appropriately links against the MPI li-

braries and sets the appropriate include and library paths.

Example: hello.c (Use vi text editor to create the file hello.c)

#include < stdio.h >

#include < mpi.h >

int main(int argc, char ** argv) {int rank, size;

char name[80];

int length;

MPI Init(&argc, &argv); // note that argc and argv are passed by address

MPI Comm rank(MPI COMM WORLD,&rank);

MPI Comm size(MPI COMM WORLD,&size);

MPI Get processor name(name,&length);

printf(”Hello MPI: processor %d of %d on %s\n”, rank,size,name);

MPI Finalize();

}After saving the above example file, compile the program using the mpicc

command.

$mpicc -o hello hello.c

The ”-o” option provides an output file name, otherwise executable would be saved

as ”a.out”. Be careful to make sure to provide an executable name if ”-o” option

is used. Many programmers have deleted part of their source code by accidentally

giving their source code as their output file name. If file name is typed correctly

and there are no bugs in the code, it will successfully compile the code, and an

”ls” command should show that the output file ”hello” is created.

$ls hello.c

$mpicc -o hello hello.c

$ls hello

hello.c hello

42

3.4 Softwares

Copy

In order for program to run on each node, the executable must exist on each node.

There are as many ways to make sure that executable exists on all of the nodes as

there are ways to put the cluster together in the first place. One method is coverd

below.

This method will assume that there exists a directory (/home/beowulf/testing)

on all the nodes, and authentication is being done via ssh, and that public keys

have been shared for the account to allow for login and remote execution without

a password.

One command that can be used to copy files between machines is ”scp”; scp is

a unix command that will securely copy files between remote machines, and in its

simplest use acts as a secure remote copy. It takes similar arguments to the unix

”cp” command.

Now save example in a directory /home/beowulf/testing (i.e. the file is saved

as /home/beowulf/testing/hello) the following command will copy the file hello to

a remote node.

$scp hello root@node1:/home/beowulf/testing

This will need to be done for each host. To check whether copy is working

properly or not, ssh into each host, and check to see that the files are there using

the ”ls” command.

Execute

Once compiled the code and copied it to all of the nodes, run it using the mpiexec

command. Two of the more common arguments to the mpiexec command are

the ”np or n” argument that specify how many processors to use, and the ”-f”

argument specify exactly which nodes are available for use. Already an entry has

been made for hosts file in .bashrc in home directory so there is no need to use

this argument.

Change directory to the file where executable is located, and run hello com-

mand using 4 processes:

$mpiexec -n 4 ./hello

Hello MPI: processor 0 of 4 on master

Hello MPI: processor 3 of 4 on node3



43

3.4 Softwares

3.4.2 HYDRA: Process Manager

Hydra is a process management system for starting parallel jobs. Hydra is de-

signed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and

sge. Starting MPICH2-1.3, hydra is the default process manager, which is auto-

matically used with mpiexec.

As there is a bug with hydra-1.4 which comes with mpich2-1.4.1p1, hydra-

1.5b1 has been installed separately. Once built, the new Hydra executables are

in mpich2/bin, or the bin subdirectory of the install directory if install have been

done. Put this (bin) directory in PATH in .bashrc for usage convenience:

Put in .bashrc: export PATH=/opt/mpich2-1.4.1p1/bin/bin:$PATH

HYDRA HOST FILE: This variable points to the default host file to use, when

the ”-f” option is not provided to mpiexec. For bash:

export HYDRA HOST FILE=< path to host file >/hosts

3.4.3 TORQUE: Resource Manager

TORQUE Resource Manager provides control over batch jobs and distributed

computing resources. It is an advanced The TORQUE Resource Manager is a

distributed resource manager providing control over batch jobs and distributed

compute nodes. Its name stands for Terascale Open-Source Resource and QUEue

Manager. Cluster Resources, Inc. describes it as open-source and Debian classifies

it as non-free owing to issues with the license. It is a community effort based on

the original PBS project and, with more than 1,200 patches, has incorporated sig-

nificant advances in the areas of scalability, fault tolerance, and features extensions

contributed by NCSA, OSC, USC, the US DOE, Sandia, PNNL, UB, TeraGrid,

and many other leading-edge HPC organizations. TORQUE can integrate with

the non-commercial Maui Cluster Scheduler or the commercial Moab Workload

Manager to improve overall utilization, scheduling and administration on a clus-

ter. TORQUE is described by its developers as open-source software, using the

OpenPBS version 2.3 license and as non-free software in the Debian Free Software

Guidelines.

Feature Set

TORQUE provides enhancements over standard OpenPBS in the following areas:

44

3.4 Softwares

Fault Tolerance

• Additional failure conditions checked/handled

• Node health check script support

Scheduling Interface

• Extended query interface providing the scheduler with additional and more

accurate information

• Extended control interface allowing the scheduler increased control over job

behavior and attributes

• Allows the collection of statistics for completed jobs

Scalability

• Significantly improved server to Message oriented middleware (MOM) com-

munication model

• Ability to handle larger clusters (over 15 TF/2,500 processors)

• Ability to handle larger jobs (over 2000 processors)

• Ability to support larger server messages

Usability

• Extensive logging additions

• More human readable logging (i.e. no more ’error 15038 on command 42’)

3.4.4 MAUI: Cluster Scheduler

Maui Cluster Scheduler is a open source job scheduler for use on clusters and

supercomputers initially developed by Cluster Resources, Inc.. Maui is capable

of supporting multiple scheduling policies, dynamic priorities, reservations, and

fairshare capabilities. Maui satisfies some definitions of open-source software and

is not available for commercial usage. It improves the manageability and effi-

ciency of machines ranging from clusters of a few processors to multi-teraflops

supercomputers.

45

3.5 System Considerations

Job State

Jobs in Maui can be in one of three major states:

Running

A jobs that have been alloted its required resources and have started its compu-

tation is considered running until it finish.

Queued (idle)

Jobs that are eligible to run. The priority is calculated here and the jobs are sorted

according to calculated priority. Advance reservations are made starting with the

job up front.

Non-queued

Jobs that, for some reason, are not allowed to start. Jobs in this state does not

gain any queue-time priority.

There is a limit on the number of jobs a group/user can have in the Queued

state. This prohibit users from acquiring longer queue-time than deserved by

submitting large number of jobs.


The following sections discuss system considerations and requirements:

Design/development Debug

There are a number of critical tools necessary for the implementation of a success-

ful HPCC cluster solution. The first is a compiler which can take advantage of the

architectural features of the processor. Next, a debugger such as gdb allows the

developer to debug the code and assists in finding the problem areas or sections

of code to be further tuned for performance. A profiler is also necessary to assist

in finding the performance bottlenecks in the overall system including the system

interconnect.

Job Control

Once an application has been developed or ported to a Beowulf cluster, the ap-

plication must be started and run on a portion or the entire cluster. Understand

46


particular needs and requirements for system partitioning, how jobs are started

and run, and how a queue of jobs can be setup to run automatically.

Checkpoint Restart

Many applications running on even very large HPC clusters will require many

hours, days, or weeks of execution time to run to completion. A failure in one

part of the system could corrupt a job execution run, forcing a restart. The solu-

tion is to periodically checkpoint the current state, writing the intermediate data

calculations available at the end of the interval to a disk subsystem. This usually

takes a small amount of time to write out the data with the compute functions

temporarily paused, the time dependent on the storage architecture. If there is a

system failure of one of the computing components, then the failing component

can be taken out of the cluster and the job restarted with the data available from

the previous periods checkpoint save.

Performance Monitoring

Even if a considerable amount of time is spent during the debug phase to tune

the application for best performance, a performance monitoring function is still

necessary to watch the cluster performance over time. With potentially multiple

job streams running concurrently on the system, each taking differing amounts of

CPU or memory, there may be situations where the applications are not running

at the expected efficiency. The performance-monitoring tool can assist in detect-

ing these situations.

Benchmarking

An excellent collection of benchmarks is the HPCC Benchmarking Suite. It con-

sists of seven well-known public domain benchmarks. The latest version allows

to compare network performance with raw TCP, PVM, and MPICH, LAM/MPI

among others. It is also worthwhile to use the latest version of HPL (High Per-

formance Linpack) benchmark. For parallel benchmarks, the above mentioned

benchmarks are a reasonable test (especially if running numerical computations

on cluster). The above and other benchmarks are necessary to evaluate different

architectures, motherboards, network cards.

47

Chapter 4

Experiments

To evaluate the usage and acceptability of the cluster and its performance few

parallel programs are implemented. The first one is a finding the prime numbers

in given range. The second is to calculate the value of π. Then one embarrassingly

parallel program to solve circuit satisfiability problem is tested. Implemented

1D Time Dependent Heat Equation and Radix-2 FFT algorithms as a real life

programs.

Also conducted two standard benchmarking experiments which are also used

to find the performance of Top500 supercomputers. The first of them is High Per-

formance Linpack Benchmark and the other one is the HPCC which is a complete

suite of seven tests covering many performance factors.

The work of a global problem can be divided into a number of independent

tasks, which rarely need to synchronize. Monte Carlo simulations or numerical

integration are examples of this. So here in below examples the code that can be

parallelized is found and then it is executed simultaneously on different cluster

node with different data. If the parallelizable code is not depend on the other

output of other nodes we get a better performance. The essence is to divide the

entire computation evenly among collaborative processors. Divide and conquer.

4.1 Finding Prime Numbers

This C program counts the number of primes between 1 and N, using MPI to carry

out the calculation in parallel. The algorithm is completely naive. For each integer

I, it simply checks whether any smaller J evenly divides it. The total amount of

work for a given N is thus roughly proportional to 1/2 ∗N2. Figure 4.1 shows the

performance of cluster for finding various primes as compared to single machine.

This program is mainly a starting point for investigations into parallelization.

48

4.2 PI Calculation

Figure 4.1: Graph showing performance for Finding Primes

Here the total range of numbers for which we want to find the primes are

divided into equal parts and then distributed amongst the computing nodes. Every

node has to carry out its task and send back the results to master node. At last

its the job of master node to combine the results of all the nodes and give the final

result.

4.2 PI Calculation

The number π is a mathematical constant that is the ratio of a circle’s circumfer-

ence to its diameter. The constant, sometimes written pi, is approximately equal

to 3.14159. It calculate the value of π using:

∫ 10

41+x2dx = π

Then compare the calculated π value with the original one and find out the ac-

curacy of the output. Also the time taken by program to calculate it is also

displayed. Figure 4.2 shows the time taken by different no. of PCs to calculate π.

To parallelize the code identify the part(s) of a sequential algorithm that can be

executed in parallel. This is the difficult part, then distribute the global work and

data among cluster nodes. Here we can parallely run different iterations of N(no.

of rectangles) from the code

49

4.3 Circuit Satisfiability Problem

Figure 4.2: Graph showing performance for Calculating π

4.3 Circuit Satisfiability Problem

CSAT is a C program which demonstrates, for a particular circuit, an exhaustive

search for solutions of the circuit satisfy problem. This version of the program

uses MPI to carry out the solution in parallel. This problem assumes that a logical

circuit of AND, OR and NOT gates is given, with N binary inputs and a single

output. Determine all inputs which produce a 1 as the output.

The general problem is NP complete, so there is no known polynomial-time

algorithm to solve the general case. The natural way to search for solutions then

is exhaustive search. In an interesting way, this is a very extreme and discrete

version of the problem of maximizing a scalar function of multiple variables. The

difference is that here it is known that both the input and output only have the

values 0 and 1, rather than a continuous range of real values!

This problem was a natural candidate for parallel computation, since the in-

dividual evaluations of the circuit are completely independent. So the complete

problem domain is divided into equal parts and then respective nodes will perform

there work to get the final results

50

4.4 1D Time Dependent Heat Equation

Figure 4.3: Graph showing performance for solving C-SAT Problem


The heat equation is an important partial differential equation which describes

the distribution of heat (or variation in temperature) in a given region over time.

This program solves

∂u

∂t− k ∗ ∂2

∂x2= f(x, t)

over the interval [A,B] with boundary conditions

u(A, t) = uA(t),

u(B, t) = uB(t),

over the time interval [t0, t1] with initial conditions

u(x, t0) = u0(x)

4.4.1 The finite difference discretization

To apply the finite difference method, define a grid of points x(1) through x(n),

and a grid of times t(1) through t(m). In the simplest case, both grids are evenly

spaced. The approximate solution at spatial point x(i) and time t(j) is denoted

by u(i,j).

51


Figure 4.4: Graph showing performance for solving 1D Time Dependent Heat

Equation

A second order finite difference can be used to approximate the second deriva-

tive in space, using the solution at three points equally separated in space.

A forward Euler approximation to the first derivative in time is used, which

relates the value of the solution to its value at a short interval in the future.

Thus, at the spatial point x(i) and time t(j), the discretized differential equa-

tion defines a relationship between u(i-1,j), u(i,j), u(i+1,j) and the ”future” value

u(i,j+1). This relationship can be drawn symbolically as a four node stencil:

Figure 4.5: Symbolic relation between four nodes

Since the value of the solution at the initial time is given, use the stencil, plus

the boundary condition information, to advance the solution to the next time step.

Repeating this operation gives us an approximation to the solution at every point

in the space-time grid.

52

4.5 Fast Fourier Transform

4.4.2 Using MPI to compute the solution

To solve the 1D heat equation using MPI, use a form of domain decomposition.

Given P processors, divide the interval [A,B] into P equal subintervals. Each

processor can set up the stencil equations that define the solution almost inde-

pendently. The exception is that every processor needs to receive a copy of the

solution values determined for the nodes on its immediately left and right sides.

Thus, each processor uses MPI to send its leftmost solution value to its left

neighbour, and its rightmost solution value to its rightmost neighbour. Of course,

each processor must then also receive the corresponding information that its neigh-

bours send to it. (However, the first and last processor only have one neighbour,

and use boundary condition information to determine the behaviour of the solution

at the node which is not next to another processor’s node.)

The naive way of setting up the information exchange works, but can be in-

efficient, since each processor sends a message and then waits for confirmation

of receipt, which can’t happen until some processor has moved to the ”receive”

stage, which only happens because the first or last processor doesn’t have to receive

information on a given step.


To make the DFT operation more practical, several FFT algorithms were pro-

posed. The fundamental approach for all of them is to make use of the properties

of the DFT operation itself. All of them reduce the computational cost of per-

forming the DFT on the given input sequence.

WnN = e−j2πkn/N

This value of Wn is referred to as the twiddle factor or phase factor. This value

of twiddle factor being a trigonometric function over discrete points around the 4

quadrants of the two dimensional plane has some symmetry and periodicity prop-

erties.

Symmetry Property: Wk+N/2N = −W k

N

Periodicty Property: W k+NN = W k

N

53


Figure 4.6: Graph showing performance Radix-2 FFT algorithm

Using these properties of the twiddle factor, unnecessary computations can

be eliminated. Another approach that can be used is the divide-and-conquer

approach. In this approach, the given single dimensional input sequence of length,

N, can be represented in a twodimensional form with M rows and L columns with

N = M x L. It can be shown that DFT that is performed on such a representation

will lead to lesser computations, N(M+L+1) complex additions and N(M+L-2)

complex additions. Please note that this approach is applicable only when the

value of N is composite.

4.5.1 Radix-2 FFT algorithm

This algorithm is a special case of the approaches described earlier in which N

can be represented as a power of 2 i.e., N = 2v. This means that the number of

complex additions and multiplications gets reduced to N(N+6)/2 and N2/2 just

by using the divide and conquer approach. When the symmetry and periodicity

property of the twiddle factor is used, it can be shown that the number of com-

plex additions and multiplications can be reduced to Nlog2N and (N/2)log2N

respectively. Hence, from a O(N2) algorithm, the computational complexity has

been reduced to O(NlogN). The entire process is divided into log2N stages and

in each stage N/2 two-point DFTs are performed. The computation involving

each pair of data is called a butterfly. Radix-2 algorithm can be implemented as

Decimation- in-time (M=N/2 and L=2) or Decimation in frequency (M=2 and

L=N/2) algorithms.

54

4.6 Theoretical Peak Performance

Figure 4.7: 8-point Radix-2 FFT: Decimation in frequency form

Figure 4.7 gives the decimation-infrequency form of the Radix-2 algorithm for

an input sequence of length, N=8.

4.6 Theoretical Peak Performance

The theoretical peak is based not on an actual performance from a benchmark run,

but on a paper computation to determine the theoretical peak rate of execution of

floating point operations for the machine. This is the number manufacturers often

cite; it represents an upper bound on performance. That is, the manufacturer

guarantees that programs will not exceed this rate for a given computer.

To calculate theoretical peak performance of the HPC system, first it required

to calculate theoretical peak performance of one node (server) in GFlops and than

just to multiply node performance on the number of nodes of HPC system. The

following formula is used for node theoretical peak performance:

Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores)

x (CPU instruction per cycle) x (number of CPUs per node)

For cluster:

CPUs based on Intel i7-2600 (3.40GHz 4-cores):

3.40 x 4 x 4 = 54.4 GFlops

CPU speed in GHz: 3.40

No. of cores per CPU: 4

55

4.7 Benchmarking

No. of instructions per cycle: 4

Four PC Clusters Theoretical Peak Performance:

54.4 GFlops x 4 = 217.6 GFlops

4.7 Benchmarking

It is generally a good idea to verify that the newly built cluster actually can do

work. This can be accomplished by running a few industry accepted benchmarks.

The purpose of benchmarking is not to get the best results, but to get consistent

repeatable accurate results that are also the best results.

4.8 HPL

HPL (High Performance Linpack) is a software package that solves a (random)

dense linear system of equations in double precision (64 bits) arithmetic on distributed-

memory computers. The performance measured using this program on several

computers forms the basis for the Top 500 super computer list. Using ATLAS

(Automatically Tuned Linear Algebra Software) for the BLAS library it gives

28.67 GFlops for 4 node cluster.

4.8.1 HPL Tuning

After having built the executable /root/hpl-2.0/bin/Linux PII CBLAS/xhpl, one

may want to modify the input data file HPL.dat. This file should reside in the

same directory as the executable /root/hpl-2.0/bin/Linux PII CBLAS/xhpl. An

example HPL.dat file is provided by default. This file contains information about

the problem sizes, machine configuration, and algorithm features to be used by

the executable. It is 31 lines long. All the selected parameters will be printed in

the output generated by the executable.

There so many ways to tackle tuning, for example:

1. Fixed Processor Grid, Fixed Block size and Varying Problem size N.

2. Fixed Processor Grid, Fixed Problem size and Varying Block size.

3. Fixed Problem size, Fixed Block size and Varying the Processor grid.

4. Fixed Problem size, Varying the Block size and Varying the Processor grid.

56

4.8 HPL

5. Fixed Block size, Varying the Problem size and Varying the Processor grid.

HPL.dat file for cluster

HPLinpack benchmark input file

Innovative Computing Laboratory, University of Tennessee

HPL.out output file name (if any)

8 device out (6=stdout,7=stderr,file)

1 # of problems sizes (N)

41328 Ns

1 # of NBs

168 NBs

0 PMAP process mapping (0=Row-,1=Column-major)

1 # of process grids (P x Q)

4 Ps

4 Qs

16.0 threshold

1 # of panel fact

2 PFACTs (0=left, 1=Crout, 2=Right)


4 NBMINs (>= 1)

1 # of panels in recursion

2 NDIVs

1 # of recursive panel fact.


1 # of broadcast

1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

1 # of lookahead depth

1 DEPTHs (>= 0)




0 U in (0=transposed,1=no-transposed) form



57

4.8 HPL

4.8.2 Run HPL on cluster

At this point all that remains is to add some software that can run on the cluster

and there is nothing better than HPL or Linpack, which is widely used to measure

cluster efficiency (the ratio between theoretical and actual performance). Do the

following steps on all nodes:

Copy Make.Linux PII CBLAS file from $(HOME)/hpl-2.0/setup/ to $(HOME)/hpl-

2.0/

Edit Make.Linux PII CBLAS file

# ———————————————————————-

# - HPL Directory Structure / HPL library ——————————

# ———————————————————————-

#

TOPdir = $(HOME)/hpl-2.0

INCdir = $(TOPdir)/include

BINdir = $(TOPdir)/bin/$(ARCH)

LIBdir = $(TOPdir)/lib/$(ARCH)

#

HPLlib = $(LIBdir)/libhpl.a

#

# ———————————————————————-

# - Message Passing library (MPI) ————————————–

# ———————————————————————-

# MPinc tells the C compiler where to find the Message Passing library

# header files, MPlib is defined to be the name of the library to be

# used. The variable MPdir is only used for defining MPinc and MPlib.

#

#MPdir = /usr/lib64/mpich2

#MPinc = -I$(MPdir)/include

#MPlib = $(MPdir)/lib/libmpich.a

#

# ———————————————————————-

# - Linear Algebra library (BLAS or VSIPL) —————————–

# ———————————————————————-

# LAinc tells the C compiler where to find the Linear Algebra library

# header files, LAlib is defined to be the name of the library to be

# used. The variable LAdir is only used for defining LAinc and LAlib.

#

58

4.8 HPL

LAdir = /usr/lib/atlas

LAinc =

LAlib = $(LAdir)/libcblas.a $(LAdir)/libatlas.a

# ———————————————————————-

# - Compilers / linkers - Optimization flags —————————

# ———————————————————————-

#

CC = /opt/mpich2-1.4.1p1/bin/mpicc

CCNOOPT = $(HPL DEFS)

CCFLAGS = $(HPL DEFS) -fomit-frame-pointer -O3 -funroll-loops

#

# On some platforms, it is necessary to use the Fortran linker to find

# the Fortran internals used in the BLAS library.

#

LINKER = /opt/mpich2-1.4.1p1/bin/mpicc

LINKFLAGS = $(CCFLAGS)

#

ARCHIVER = ar

ARFLAGS = r

RANLIB = echo

#

# ———————————————————————-

After configuring the above file Make.Linux PII CBLAS run

$(HOME)make arch=Linux PII CBLAS

Now run Linpack (on a single node):

$(HOME)cd bin/Linux PII CBLAS

$mpiexec -n 4 ./xhpl

Repeat steps 1- 5 on all nodes and the now Linpack can be run on all nodes like

this (from directory $(HOME)/hpl-2.0/Linux PII CBLAS/ )

$mpiexec -n x ./xhpl

where x is the number of cores in cluster.

4.8.3 HPL results

The first thing to note is that the HPL.dat file that is available post install is

simply useless to extract any kind of meaningful performance numbers, so the

59

4.9 Run HPCC on cluster

file needs to be edited. The first test using the default configuration. Then tune

the HPL.dat and test again. For a single PC the HPL gave performance of 11.25

Figure 4.8: Graph showing High Performance Linpack (HPL) Results

GFlops. The highest value which is given by the cluster of four machines is 28.67

GFlops. Which means there is an absolute performance gain in cluster over a

single machine.

It is interesting to note that the maximum performance (28.67 GFlops) was

achieved for a problem size of 30000 and a block size of 168, although to be fair,

the difference between a block size of 168 and 128 is small. Also interesting is how

much the data varies for different problem size, the PCs in the cluster don’t have

a separate network and thus performance is unlikely to ever be constant.

The efficiency of the cluster is 13%, which is appalling, but given the various

limitations in the system it’s perhaps not that surprising.


The HPC Challenge Benchmark set of tests is primarily High Performance Lin-

pack along with some additional bells-and-whistles tests. The nice thing is the

experience in running HPL can be directly leveraged in running HPCC, and vice-

versa.

Instead of a binary named xhpl, with HPCC, a binary named hpcc is generated

after compiling the HPC Challenge Benchmark. This binary runs the whole series

of tests. First download hpcc-1.4.1.tar.gz and save into /root directory. Following

60


is a set of commands to get going with HPCC:

#cd /root

#tar xzvf hpcc-1.4.1.tar.gz

#cd hpcc-1.4.1

#cd hpl

#cp setup/Make.Linux PII CBLAS

#vi Make.Linux PII CBLAS

Apply the same changes to this file as in section Compiling and Running HPL

above except Topdir=../../.. Next, there is a need to build the HPC Challenge

Benchmark, configure hpccinf.dat (which can be derived from the previous settings

for HPL.dat), and then invoke the tool. After modifying Make.Linux PII CBLAS:

#cd /root/hpcc-1.4.1

#make arch=Make.Linux PII CBLAS

Copy hpccinf.txt found in the root hpcc-1.4.1 directory to hpccinf.txt. Make the

following changes to lines 33-36 to control the problem sizes and blocking factors

for PTRANS.

Change lines 33-34 (number of PTRANS problems sizes and the sizes) to:

4 Number of additional problem sizes for PTRANS

1000 2500 5000 10000 values of N

Change lines 35-36 (number of block sizes and the sizes) to:

2 Number of additional blocking sizes for PTRANS

64 128 values of NB

Now run it:

#cd /root/hpcc-1.4.1

#mpiexec np < numprocs > ./hpcc

The results will be in hpccoutf.txt.

4.9.1 HPCC Results

Finally few HPCC benchmark runs are carried out. As with the Linpack bench-

marks, the HPCC benchmark with ATLAS is also compiled.

Generally speaking cluster continues to perform better than single PC but

clearly some of the benchmarks are hardly affected at all.

It is worth bearing in mind that this four node cluster does not have its own

separate network switch and thus results will vary more than in a cluster with

dedicated networking. Table 4.1 shows some import results from various tests of

HPCC Benchmark suite.

61


Test One Processor Cluster

HPL Tflops 0.00072716 0.0283605

StarDGEMM Gflops 4.83506 4.77583

SingleDGEMM Gflops 4.79438 4.92708

PTRANS GBs 0.0425573 0.0409784

MPIRandomAccess LCG GUPs 0.00707042 0.00663434

MPIRandomAccess GUPs 0.00706074 0.00660636

StarRandomAccess LCG GUPs 0.176132 0.0170042

SingleRandomAccess LCG GUPs 0.171557 0.0344993

StarRandomAccess GUPs 0.24612 0.0183594

SingleRandomAccess GUPs 0.241174 0.0448233

StarSTREAM Copy 27.0668 2.92135

StarSTREAM Scale 25.3788 2.91262

StarSTREAM Add 27.1221 3.23188

StarSTREAM Triad 25.4848 3.40194

SingleSTREAM Copy 26.7578 10.9827

SingleSTREAM Scale 24.7451 10.9912

SingleSTREAM Add 26.3792 12.6537

SingleSTREAM Triad 24.7451 12.7064

StarFFT Gflops 2.14797 1.33174

SingleFFT Gflops 2.10237 1.85049

MPIFFT N 65536 134217728

MPIFFT Gflops 0.0587352 0.107084

MaxPingPongLatency usec 340.059 344.502

RandomlyOrderedRingLatency usec 167.139 154.586

MinPingPongBandwidth GBytes 0.0116524 0.0116511

NaturallyOrderedRingBandwidth GBytes 0.0104104 0.00243253

RandomlyOrderedRingBandwidth GBytes 0.00981377 0.00228357

MinPingPongLatency usec 322.998 0.203097

AvgPingPongLatency usec 334.724 267.98

MaxPingPongBandwidth GBytes 0.0116628 0.0116638

AvgPingPongBandwidth GBytes 0.0116578 0.0116587

NaturallyOrderedRingLatency usec 126.505 130.391

Table 4.1: HPCC Results on Single PC and Cluster

62

Chapter 5

Results and Applications

5.1 Discussion on Results

Clusters effectively reduce the overall computational time, demonstrating excellent

performance improvement in terms of Flops.Finally, performance on clusters may

be limited by interconnect speed. Finally, performance on clusters may be limited

by interconnect speed. The choice of which interconnect to use depends more on

whether inter-server communications will be a bottleneck in the mix of jobs to be

run.

5.1.1 Observations about Small Tasks

1. Jobs with very small numbers are bound by communication time.

2. Since sequential runtime is so small, the time to send and receive from the

head node makes the program take longer with more nodes, and makes

adding processors slow down the programs runtime.

3. Parallel execution of such computations is impractical.

4. Speedup is observed by using a small cluster, but it doesnt scale well at all.

5. Its better off with one processor than even a remotely large cluster.

5.1.2 Observations about Larger Tasks

1. Jobs with larger numbers as input are bound by sequential computation time

for a small number of processors, but eventually adding processors causes

communication time to take over.

63

5.2 Factors affecting Cluster performance

2. Sequential runtime with large numbers is much larger, so it scales much

better than with small numbers as input.

3. Inter-node communication has a much larger effect on runtime than intra-

node communication.

4. With infinitely large numbers, communication times would be negligible.

5. Unlike with job requiring very little sequential computation and a lot of

communication, this job achieved speedup with large numbers of processors.

Due to the various overheads discussed throughout certain part of a sequential

algorithm cannot be parallelized we may not achieve an optimal parallelization. In

such cases the performance gain is not there rather in some cases the performance

is degraded due to communication and synchronization overhead.

5.2 Factors affecting Cluster performance

As per the result analysis of various tests and benchmarks here are few of the most

important factors which affect the performance of the cluster. Metrics having

significant affect on Linpack are:

1. Problem Size

2. Size of Blocks

3. Topology

Tightly Coupled MPI Applications

1. Very sensitive to network performance characteristics like internodal com-

munications delay or OS Network Stack

2. Very sensitive to mismatched node performance like random OS activities

can add msec delays to usec type communication line delays.

5.3 Benefits

1. Cost-effective: Built from relatively inexpensive commodity components

that are widely available.

64

5.4 Challenges of parallel computing

2. Keeps pace with technologies: Use mass-market components. Easy to em-

ploy the latest technologies to maintain the cluster.

3. Flexible configuration: Users can tailor a configuration that is feasible to

them and allocate the budget wisely to meet the performance requirements

of their applications.

4. Scalability: Can be easily scaled up by adding more compute nodes.

5. Usability: The system can be used by specified users to achieve specified

goals with effectiveness, efficiency, and satisfaction in a specified context of

use.

6. Manageability: Group of systems can be managed as a single system or

single database, without having to sign on to individual systems. Even a

cluster administrative domain can be used to more easily manage resources

that are shared within a cluster.

7. Reliability: The system, including all hardware, firmware, and software, will

satisfactorily perform the task for which it was designed or intended, for a

specified time and in a specified environment.

8. High availability: Each compute node is an individual machine. The failure

of a compute node will not affect other nodes or the availability of the entire

cluster.

9. Compatibility and Portability: A parallel application using MPI can be

easily ported from expensive parallel computers to a Beowulf cluster.


Parallel Programming is not constrained to just the problem of selection of whether

to code using threads, message passing or some other tool. But in general, anybody

working in the field of parallelization must consider the overall picture containing

a plethora of issues like:

1. Understanding the hardware: An understanding of the parallel computer

architecture is necessary for efficient mapping and distribution of computa-

tional tasks. A simplified classification of parallel architectures is UMA/NUMA

and distributed systems. A typical application may have to run on a com-

bination of these architectures.

65


2. Mapping and distribution on to the hardware: Mapping and distribution of

both computational tasks on processors and of data onto memory elements

must be considered. The whole application must be divided into compo-

nents and subcomponents and then these components and subcomponents

distributed on the hardware. The distribution may be static or dynamic.

3. Parallel Overhead: Parallel overhead refers to the amount of time required

to coordinate parallel tasks as opposed to doing useful work. Typical par-

allel overhead includes the time to start/terminate a task, the time to pass

messages between tasks, synchronization time, and other extra computation

time. When parallelizing a serial application, overhead is inevitable. De-

velopers have to estimate the potential cost and try to avoid unnecessary

overhead caused by inefficient design or operations.

4. Synchronization: Synchronization is necessary in multi-threading programs

to prevent race conditions. Synchronization limits parallel efficiency even

more than parallel overhead in that it serializes parts of the program. Im-

proper synchronization methods may cause incorrect results from the pro-

gram. Developers are responsible for pinpointing the shared resources that

may cause race conditions in a multi-threaded program, and they are re-

sponsible also for adopting proper synchronization structures and methods

to make sure resources are accessed in the correct order without inflicting

too much of a performance penalty.

5. Load Balance: Load balance is important in a threaded application because

poor load balance causes under utilization of processors. After one task fin-

ishes its job on a processor, the processor is idle until new tasks are assigned

to it. In order to achieve the optimal performance result, developers need to

find out where the imbalance of the work load lies between different threads

running on the processors and fix this imbalance by spreading out the work

more evenly for each thread.

6. Granularity: For a task that can be divided and performed concurrently by

several subtasks, it is usually more efficient to introduce threads to perform

some subtasks. However, there is always a tipping point where performance

cannot be improved by dividing a task into smaller-sized tasks (or introduc-

ing more threads). The reasons for this are 1) multi-threading causes extra

overhead; 2) the degree of concurrency is limited by the number of proces-

sors; and 3) for most of the time, one subtask’s execution is dependent on

66

5.5 Common applications of high-performance computing clusters

another’s completion. That is why developers have to decide to what extent

they make their application parallel. The bottom line is that the amount of

work per each independent task should be sufficient to leverage the threading

cost.

5.5 Common applications of high-performance

computing clusters

Almost everyone needs fast processing power. With the increasing availability of

cheaper and faster computers, more people are interested in reaping the techno-

logical benefits. There is no upper boundary to the needs of computer processing

power; even with the rapid increase in power, the demand is considerably more

than what’s available.

1. Scheduling: Manufacturing: Transportation (Dairy delivery to military de-

ployment); University classes; Airline scheduling.

2. Network Simulations: Power Utilities, Telecommunications providers simu-

lations.

3. Computational ElectroMagnetics: Antenna design; Stealth vehicles; Noise

in high frequency circuits; Mobile phones.

Figure 5.1: Application Perspective of Grand Challenges

67

5.5 Common applications of high-performance computing clusters

4. Environmental Modelling-Earth/Ocean/Atmospheric Simulation: Weather

forecasting, climate simulation, oil reservoir simulation, waste repository

simulation

5. Simulation on Demand: Education, tourism, city planning, defense mission

planning, generalized flight simulator.

6. Graphics Rendering: Hollywood movies, Virtual reality.

7. Complex Systems Modelling and Integration: Defense (SIMNET, Flight

Simulators), Education (SIMCITY), Multimedia/VR in entertainment, Mul-

tiuser virtual worlds, Chemical and Nuclear plant operation .

8. Financial and Economic Modelling: Real time optimisation, Mortgage backed

securities, Option pricing.

9. Image Processing: Medical instruments, EOS Mission to Planet Earth, De-

fense Surveillance, Computer Vision.

10. Healthcare and Insurance Fraud Detection: Inefficiency, Securities fraud,

Credit card fraud.

11. Market Segmentation Analysis: Marketing and sales planning. Sort and

classify records to determine customer preference by region (city and house).

68

Chapter 6

Conclusion and Future Work

6.1 Conclusion

The implemented HPCC system allows any research center to install and use a low-

cost parallel programming environment, which may be administered in an easy-

to-use basis even by staff unfamiliar with clusters. Such clusters allow evaluating

the efficiency of any parallel code to solve the computational problems faced by

the scientific community. This type of parallel programming environments are

expected to be subject to a great development efforts within the coming years,

since an increasing number of universities and research centers around the world

include Beowulf clusters in their hardware. The main disadvantage with this type

of environment could be the latency of the interconnections between the machines.

This HPCC can be used for research on object-oriented parallel languages,

recursive matrix algorithms, network protocol optimization, graphical rendering

etc. Also it can be used to create college’s own cloud and deploy cloud applications

on it, which can be accessed from anywhere outside world just with the help of

web browser. Computer science and Information Technology students will receive

extensive experience using such cluster, and t is expected that several students

and faculty will use it for their project and research work.

6.2 Future Work

As computer networks become cheaper and faster, a new computing paradigm,

called the Grid, has evolved. The Grid is a large system of computing resources

that performs tasks and provides to users a single point of access, commonly based

on the World Wide Web interface, to these distributed resources. Users can submit

69

6.2 Future Work

thousands of jobs at a time without being concerned about where they run. The

Grid may scale from single systems to supercomputer-class compute farms that

utilise thousands of processors.

By providing scalable, secure, high-performance mechanisms for discovering

and negotiating access to remote resources, the Grid promises to make it possible

for colleges and universities in collaboration to share resources on an unprece-

dented scale, and for geographically distributed groups to work together in ways

that were previously impossible.

Additionally, the HPCC can be used to create cloud applications and give

actual experience of this very booming technology to students. The advantages of

cloud computing could work in the students advantage when it comes to getting

hands-on experience in managing environments. Before virtualization, it would

have been impossible for an individual student to practice managing their own

multiple-server environment. Even just three servers would have cost thousands

of dollars in years past. But now, with virtualization, it takes just a few minutes to

spin up three new VMs. If a college were to leverage virtualization in its classroom,

students could manage their own multi-server environment in the cloud with ease.

The student could control everything from creation of the VMs to their retirement,

giving them great experience in one of the hottest fields in IT.

70

Bibliography

[1] Christian Vecchiola, Suraj Pandey, and Rajkumar Buyya : High-Performance

Cloud Computing: A View of Scientific Applications at Proceedings of the 10th

International Symposium on Pervasive Systems, Algorithms and Networks (I-

SPAN 2009, IEEE CS Press, USA), Kaohsiung, Taiwan, December 14-16, 2009

[2] Luiz Carlos Pinto, Luiz H. B. Tomazella, M. A. R. Dantas : An Experimental

Study on How to Build Efficient Multi-Core Clusters for High Performance

Computing at 2008 11th IEEE International Conference on Computational

Science and Engineering.

[3] IkerCastaos, IzaskunGarrido, AitorGarrido, GorettiSevillano: Design and Im-

plementation of an easy-to-use Automated System to build Beowulf Parallel-

Computing Clusters at University of the Basque, IEEE International Confer-

ence 2009

[4] Azzedine Boukerche Raed Al-Shaikh and Mirela Sechi Moretti Notare :To-

wards Building a Highly-Available Cluster Based Model for High Performance

Computing at Proceedings 20th IEEE International Parallel and Distributed

Processing Symposium 2006

[5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G.

Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia. : Above the Clouds: A

Berkeley View of Cloud computing. Technical Report No. UCB/EECS-2009-

28, University of California at Berkley, USA, Feb. 10, 2009.

[6] R. Buyya, C.S. Yeo, and S. Venugopal, Market-Oriented Cloud Computing:

Vision, Hype, and Reality for Delivering IT Services as Computing Utilities,

Keynote Paper, in Proc. 10th IEEE International Conference on High Perfor-

mance Computing and Communications (HPCC 2008), IEEE CS Press, Sept.

2527, 2008, Dalian, China.

71

BIBLIOGRAPHY

[7] Bonnie Holte Bennett, Emmett Davis, Timothy Kunau : Beowulf Parallel

Processing for Dynamic Load-balancing, IEEE International Conference 1999

[8] Lustre File System : High-Performance Storage Architecture and Scalable

Cluster File System White Paper December 2007

[9] Amina Saify, Garima Kochhar, Jenwei Hsieh, and Onur Celebioglu: Enhancing

High-Performance Computing Clusters with Parallel File Systems, 2005

[10] Rajkumar Buyya, High Performance Cluster Computing: Architectures and

Systems, Vol. 1 Ed. Prentice Hall PTR, Upper Saddle River, NJ, 1999

[11] Prabhu, C.S.R., Grid and Cluster Computing. Prentice Hall India 2009

[12] Judith Hurwitz, Robin Bloor, Marcia Kaufman, Fern Halper, Cloud Com-

puting For Dummies, John Wiley and Sons, 2009

[13] Barrie Sosinsky, Cloud Computing Bible, Wiley India 2011

[14] Christopher Negus, Timothy Boronczyk, CentOS Bible, Wiley, 2009

[15] Vladimir Silva, Grid Computing for Developers, Dreamtech Press, 2006

[16] Peter Membrey, Tim Verhoeven, Ralph Angenendt , The Definitive Guide to

CentOS, Apress, 2009

[17] Grid computing, http://www.ctwatch.org/quarterly/articles/2006/

02/garuda-indias-national-grid-computing-initiative/1/index.html

[18] Torque resources, http://www.adaptivecomputing.com/products/

open-source/torque/

[19] Introduction to Torque, http://www.clusterresources.com/

torquedocs21/p.introduction.shtml

[20] High Performance Computing Training, https://computing.llnl.gov/

?set=trainingandpage=index

[21] Applications of HPCC, http://www.new-npac.org/projects/cdroms/

cewes-1999-06-vol1/nhse/roadmap/applications/

[22] Beowulf Project Overview, http://www.beowulf.org/overview/index.

html

[23] Beowulf clusters, http://www.lehigh.edu/computing/linux/beowulf/

72

http://www.ctwatch.org/quarterly/articles/2006/02/garuda-indias-national-grid-computing-initiative/1/index.html

http://www.ctwatch.org/quarterly/articles/2006/02/garuda-indias-national-grid-computing-initiative/1/index.html

http://www.adaptivecomputing.com/products/open-source/torque/

http://www.adaptivecomputing.com/products/open-source/torque/

http://www.clusterresources.com/torquedocs21/p.introduction.shtml

http://www.clusterresources.com/torquedocs21/p.introduction.shtml

https://computing.llnl.gov/?set=training and page=index

https://computing.llnl.gov/?set=training and page=index

http://www.new-npac.org/projects/cdroms/cewes-1999-06-vol1/nhse/roadmap/applications/

http://www.new-npac.org/projects/cdroms/cewes-1999-06-vol1/nhse/roadmap/applications/

http://www.beowulf.org/overview/index.html

http://www.beowulf.org/overview/index.html

http://www.lehigh.edu/computing/linux/beowulf/

BIBLIOGRAPHY

[24] Parallel Virtual Machine, http://www.csm.ornl.gov/pvm/

[25] Open MPI Project, http://www.open-mpi.org.

[26] Message Passing Interface, http://www.unix.mcs.anl.gov/mpi/

[27] Beowulf Overview, http://www.beowulf.org/overview/faq.html17

[28] High-performance Linux clustering, Part 1: Clustering fundamentals, http:

//www.ibm.com/developerworks/linux/library/l-cluster1/, 2005

73

http://www.csm.ornl.gov/pvm/

http://www.open-mpi.org

http://www.unix.mcs.anl.gov/mpi/

http://www.beowulf.org/overview/faq.html17

http://www.ibm.com/developerworks/linux/library/l-cluster1/

http://www.ibm.com/developerworks/linux/library/l-cluster1/

Appendix A

PuTTy

PuTTY is a free and open source terminal emulator application which can act

as a client for the SSH, Telnet, rlogin, and raw TCP computing protocols and as

a serial console client. The name ”PuTTY” has no definitive meaning, though

”tty” is the name for a terminal in the Unix tradition, usually held to be short for

Teletype.

PuTTY was originally written for Microsoft Windows, but it has been ported

to various other operating systems. Official ports are available for some Unix-

like platforms, with work-in-progress ports to Classic Mac OS and Mac OS X, and

unofficial ports have been contributed to platforms such as Symbian and Windows

Mobile.

A.1 How to use PuTTY to connect to a remote

computer

1. First download and install PuTTy. Open PuTTy By Double Clicking The

PuTTy Icon.

2. In the host name box, enter the server name which account is being hosted

on ( For Example: 115.119.224.72 ). under protocol choose SSH and then

press open.

3. It will then give a dialogue box like this, don’t be alarmed, simply press yes

when prompted.

4. It will prompt to enter login name (username) and then password, simply

enter username or login name, hit enter and then type password (password

74

A.2 PSCP

Figure A.1: Putty GUI

Figure A.2: Putty Security Alert

won’t be visible. This is how linux and unix server work). Then hit enter.

Also, please remember, passwords are case sensitive.

A.2 PSCP

PSCP, the PuTTY Secure Copy client, is a tool for transferring files securely

between computers using an SSH connection. If SSH 2 server is there, prefer

PSFTP for interactive use. PSFTP does not in general work with SSH 1 servers,

however.

75

A.2 PSCP

Figure A.3: Putty Remote Login Screen

A.2.1 Starting PSCP

PSCP is a command line application. This means that just double-click on its

icon to run it won’t work and instead bring up a console window. With Windows

95, 98, and ME, this is called an MS-DOS Prompt and with Windows XP, Vista

and Windows 7 it is called a Command Prompt. It should be available from the

Programs section of Start Menu.

To start PSCP it will need either to be on PATH or in current directory. To

add the directory containing PSCP to PATH environment variable, type into the

console window:

set PATH=C:\Program Files < x86 >\PuTTy

This will only work for the lifetime of that particular console window. To set

PATH more permanently on Windows NT, use the Environment tab of the System

Control Panel. On Windows XP, Vista,7 edit AUTOEXEC.BAT to include a set

command like the one above.

A.2.2 PSCP Usage

To copy the local file c:\documents\foo.txt form windows to the linux server ex-

ample.com as user beowulf to the folder /tmp type:

C:\Users\FOSS>pscp c:\documents\foo.txt [email protected]:/tmp

To copy the local file /root/hosts from linux machine to the file e:\tmp on windows

type:

C:\Users\FOSS>pscp [email protected]:/root/hosts e:\tmp

76

Documents

Design and Implementation of High Performance Computing ... · PDF filePerformance Computing Cluster for Educational Purpose ... Design and Implementation of High Performance Computing