High-Performance Grid Computing and Research Networking

1

High-Performance Grid Computing and High-Performance Grid Computing and Research NetworkingResearch Networking

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/

sadjadi At cs Dot fiu Dot edu

Introduction to Introduction to High Performance ComputingHigh Performance Computing

2

Acknowledgements The content of many of the slides in this lecture notes have

been adopted from the online resources prepared previously by the people listed below. Many thanks!

Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]

Ligang He http://www.dcs.warwick.ac.uk/~liganghe Email: [email protected]

Kai Wang Department of Computer Science University of South Dakota http://www.usd.edu/~Kai.Wang

Kyril Faenov Director of High Performance Computing Windows Server Group

Andrew Tanenbaum

mailto:[email protected]

http://www.usd.edu/~Kai.Wang

3

Agenda

HPC Introduction HPC Applications HPC Goals Concurrency History

4

High Performance Computing

Difficult to define - it’s a moving target. In 1980s:

a “supercomputer” was performing 100 Mega FLOPS FLOPS: FLoating point Operations Per Second

Today: a 2G Hz desktop/laptop performs a few Giga FLOPS a “supercomputer” performs tens of Tera FLOPS

(Top500) High Performance Computing: loosely an order of 1000

times more powerful than the latest desktops

5

Units of Measure in HPC High Performance Computing (HPC) units are:

Flops: floating point operations Flop/s: floating point operations per second Bytes: size of data (double precision floating point number is 8)

Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576)Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824)Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776)Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624)Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte

6

Metric Units

The principal metric prefixes.

7

High Performance Computing

HPC: “The term high performance computing (HPC)

refers to the use of (parallel) supercomputers and computer clusters, that is, computing systems comprised of multiple (usually mass-produced) processors linked together in a single system with commercially available interconnects.”

Wikipedia “This is in contrast to mainframe computers,

which are generally monolithic in nature.” Wikipedia

8

High Performance Computing HPC:

“The more current and evolving definition of HPC refers to High Productivity Computing, and reflects the purpose and use model of the myriad of existing and evolving architectures, and the supporting ecosystem of software, middleware, storage, networking and tools behind the next generation of applications.”

Wikipedia

Parallel Computing: Computing on parallel computers

Super Computing: Computing on top 500 machines

9

High Performance Computing The definition that we use in this course

“How do we make computers to compute bigger problems faster?”

Three main issues Hardware: How do we build faster computers? Software: How do we write faster programs? Hardware and Software: How do they interact?

Many perspectives architecture systems programming modeling and analysis simulation algorithms and complexity

Theory

Practice

10

High Performance Computing HPC Related Technologies

HPC is an all-encompassing term for related technologies that continually push computing boundaries.

1. Computer architecture CPU, memory, VLSI

2. Compilers Identify inefficient implementations Make use of the characteristics of the computer architecture Choose suitable compiler for a certain architecture

3. Algorithms (for parallel and distributed systems) How to program on parallel and distributed systems

4. Middleware From Grid computing technology Application->middleware->operating system Resource discovery and sharing

11

High Performance Computing The key techniques for making computers compute “bigger

problems faster” is to use multiple computers at once Later in this lecture, we will learn why!

This is called parallelism It takes 1000 hours for this program to run on one computer!

Well, if I use 100 computers maybe it will take only 10 hours?! This computer can only handle a dataset that’s 2GB!

So maybe if I use 100 computers I can deal with a 200GB dataset?!

We will spend enough time to learn and experience different flavors of parallel computing

shared-memory parallelism distributed-memory parallelism hybrid parallelism

12

Agenda


13

Words of Wisdom “Four or five computers should be enough for the entire world

until the year 2000.” T.J. Watson, Chairman of IBM, 1945.

“640KB [of memory] ought to be enough for anybody.” Bill Gates, Chairman of Microsoft,1981.

You may laugh at their vision today, but … Lesson learned: Don’t be too visionary and try to make things work! ;)

We now know this was not quite true! Games Digital video/images Databases Operating systems

But the first people to really need more computing oomph where scientists

And they go way back

14

Evolution of Science Traditional scientific and engineering:

1) Do theory or paper design

2) Perform experiments or build system Limitations:

Too difficult -- build large wind tunnels Too expensive -- build a throw-away airplane Too slow -- wait for climate or galactic evolution Too dangerous -- weapons, drug design, climate

experiments Solution:

3) Use high performance computer systems to simulate the phenomenon

15

Scientific Computing Use of computers to solve/compute scientific

models For instance, many natural phenomena can be well

approximated by differential equations Classic Example: Heat Transfer

Consider a “1-D” material between 2 heat sources

T = H T = L

x

16

Scientific Computing Use of computers to solve/compute scientific

models For instance, many natural phenomena can

be well approximated by partial differential equations (PDEs)

Problem: compute f(x,t)

T = H T = L

0 < x < X

f(x,t): temperature at location x at time t

17

Heat Transfer

The laws of physics say that:

where alpha depends on the material where f(0,t) = H, f(X,t) = L and f(x,0) are all fixed

Called the boundary conditions

Question: How do we solve this PDE? It does not have an analytical solution Therefore it must be solved numerically (i.e., via

approximation)

18

Heat Transfer One well-known methods to solve the heat equation

is called “finite differences” Approach:

Discretize the domain: decide that the values of f(x,t) will only be known for some finite (but large) number of values of x and t

The discretized domain is called a mesh All x values are separated by ∆x All t values are separated by ∆t

Then, one replaces partial derivatives by algebraic differences

In the limit, when ∆x and ∆t go to zero, we get close to the real solution

19

Heat Transfer There are many different approximations of the partial

derivatives, based on Taylor series developments, etc. For instance, denoting f(x,t) as (discrete) fi,m, we can write the

“Forward Time, Centered Space” (FTCS) heat transfer equation as:

The various discretizations of the heat transfer equation have advantages and drawbacks in terms of

complexity numerical stability (if you’re into it, there are countless papers and textbooks)

We have transfer a difficult PDE into some type of algebraic induction!

Easy to compute in an iterative fashion Given all the values at time m, one can compute all the values at time

m+1

20

Heat Transfer Summary

But they all use some matrix or volume of numbers (in the 2-D and 3-D cases) and iteratively do additions, multiplications and divisions, for many iterations

Therefore, we can replace difficult calculus by simple computations on multi-dimensional arrays of numbers

Challenges These matrices may be really big, for better resolution and

larger domains Large Data The number of additions and multiplications can be

overwhelming Heavy Computation Hence

the early and always constant need of scientists to get bigger memories and faster CPUs

21

HPC Applications Science

Global climate modeling Astrophysical modeling Biology: genomics; protein folding; drug design Computational Chemistry Computational Material Sciences and Nanosciences

Engineering Crash simulation Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design)

Business Financial and economic modeling Transaction processing, web services and search engines

Defense Nuclear weapons -- test by simulation Cryptography

22

Example: Computational Fluid Dynamics (CFD)

Replacing NASA’s Wind Tunnels with Computers

23

Example: Global Climate Problem is to compute:

f (latitude, longitude, elevation, time) temperature, pressure, humidity, wind velocity

Approach: Discretize the domain, e.g., a measurement point every 10 km Devise an algorithm to predict weather at time t+1 given t

Uses: Predict El Nino Set air emissions standards

Source: http://www.epm.ornl.gov/chammp/chammp.html

24

Global Climate Requirements One piece is modeling the fluid flow in the atmosphere

Solve Navier-Stokes problem Roughly 100 Flops per grid point with 1 minute timestep

Computational requirements: To match real-time, need 5x1011 flops in 60 seconds = 8 Gflop/s Weather prediction (7 days in 24 hours) 56 Gflop/s Climate prediction (50 years in 30 days) 4.8 Tflop/s To use in policy negotiations (50 years in 12 hours) 288 Tflop/s

Let’s make it even worse! To 2x grid resolution, computation is > 8x State of the art models require integration of atmosphere, ocean, sea-

ice, land models, plus possibly carbon cycle, geochemistry and more

Current models are coarser than this!

25High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL

27

Agenda


28

Goals of HPC Minimize turn-around time

to complete specific application problems (strong scaling)

Maximise the problem size that can be solved given a set amount of time

(weak scaling) Identify compromise between

performance and cost. Note: Most supercomputers are obsolete

in terms of performance before the end of their physical life.

29

Maximizing Performance How is performance maximized?

Reduce the time per instruction (cycle time) [1]: clock rate. Increase the number of instructions executed per-cycle [2]: pipelining. Allow multiple processors to work on different parts of the same

program at the same time [3]: parallel execution. When performance is gained from [1] and [2]

There is a limit to how quick processors will operate. Speed of light and electricity. Heat dissipation. Power consumption An instruction processing procedure cannot be divided into infinite

stages When performance improvements come from [3]

Overhead of communications

30

A 10 TFlop/s CPU? Question: Could we build a single CPU that delivers 10,000

billion floating point operations per second (10 TFlops), and operates over 10,000 billion bytes (10 TByte)?

Representative of what many scientists need today. Clock rate has to be 10,000 GHz Assume that data travels at the speed of light Assume that the computer is an “ideal” sphere

CPU

31

A 10 TFlop/s CPU? Assume that the machine issues one instruction per cycle

therefore the clock rate must be 10,000GHz ~ 1013 Hz Data must travel some distance from the memory to the CPU

Assume that Each instruction will need at least one 8 bytes of memory Assume that data travels at the speed of light c=3x108 m/s Then the distance between the memory and the CPU must be r < c / 1013 ~

3x10-6 m Then we must have 1013 bytes of memory in 4/3r3 = 3.7e-17 m3

Therefore, each word of memory must occupy 3.7e-30 m3

This is 3.7 Angstrom3

Or the volume of a very small molecule that consists of only a few atoms

Current memory densities are 10GB/cm3, or about a factor 1020 from what would be needed!

Conclusion: It’s not going to happen until some scifi breakthrough happens

32

Agenda


33

Concurrency Since we cannot conceivably build a single CPU to solve

relevant scientific problems, we resort to concurrency: execution of multiple “tasks” at the “same” time

Concurrency is everywhere in computers Load a word from memory while adding two registers Adding two pairs of registers at the same time Receiving data from the network while writing to disk Dual-proc systems Clusters of workstations SETI@home

Some concurrency is “true” meaning that things really happen at the same time

Some concurrency is just the illusion of simultaneous execution, with rapid switching among activities

34

Concurrent, parallel, distributed? “Concurrency” is typically the more general term

A program is said to be concurrent if it contains more than one execution context

e.g., more than one thread/process

Typically the word “parallel” implies some notion of high performance / scientific application running on a single hardware platform

The word “distributed” typically refers to applications that run on multiple computers that may not be in the same room

These terms are conflated and misused all the time; in different research communities they mean different things.

We’ll see that distinctions are disappearing anyway

35

Two Types of HPC Parallel Computing

Breaking the problem to be computed into parts that can be run simultaneously in different processors

Distributed Computing Parts of the work to be computed are computed in

different places Note: does not necessarily imply simultaneous

processing An example: C/S model Solve loosely-coupled problems

(no much communication)

36

Parallel Computing Architectures of Parallel Computing

SMP (Symmetric Multi-Processing) Multiple CPUs, single memory, shared I/O All resources in a SMP machine are equally available to each

CPU Does not scale well to a large number of processors (less than 8)

NUMA (Non-Uniform Memory Access) Multiple CPUs Each CPU has fast access to its local area of the memory, but

slower access to other areas Scale well to a large number of processors Complicated memory access pattern

MPP (Massively Parallel Processing) Cluster

37

Reasons for Concurrency

Concurrency arises for at least 4 reasons1. To increase performance or memory capacity

2. To allow users and computers to “collaborate”

3. To capture the logical structure of a problem

4. To cope with independent physical devices

38

Reason #1 To increase performance

39

Reason #1 (cont.) To increase memory capacity Example

A 3D weather simulation over Kaneohe Bay (1-meter resolution) Say we consider a volume 2km x 2km x 1km over the bay Each zone is characterized by, say, temperature, wind direction,

wind velocity, air pressure, air moisture, for a total of (1+3+1+1+1)*8 = 56 bytes

Therefore we need about 208GB of memory to hold the data Option #1: Buy a machine with > 208 GB RAM

96GB server from Sun: about 1 million dollar! They have a 288GB configuration (contact them for price) There is a 3TB shared-memory SGI machine at NCSA

Option #2: Couple individual machines together Buy 52 4GB Power-edge servers from Dell for 2.5K each Slap some network on them and you’ve got enough memory total cost: ~ 200K

But: it’s not as simple as that!

40

Reason #1 (cont.)Interferometer Gravitational Wave Observatory (LIGO)Tiny distortions of space and time caused when very large masses, such as stars, move suddenly.1TB/day (1024 GB/day), Year-long experiments

The Compact Muon SolenoidAt CERN, designed to study proton-proton collisions with high quality measurements (12,000 tons)10 GB/sec!!!Many PB/year (1024 TB/year)

41

Reason #2 To allow users and computers to collaborate Example

Assume that we want to allow users to do on-line purchases

We need Web browsers, Web servers, Database servers

All these are processes They all communicate with multiple processes

simultaneously, they are all multithreaded, running on multiple machines, some of them are multi-proc servers

It’s just a big concurrent system and it is critical that it be fast and correct!

42

Reason #3 To capture the logical structure of a problem Example

Let’s assume that we want to write a program that simulates the interactions between a robot and living entities

We can implement the robot as its own thread The code is just the code of the robot

We can implement each entity as its own thread The code is the simulation of the entity’s behavior

Now we let them “loose” at the beginning They may meet, interact, etc. All of this happens without a central notion of control, although I

may be running on a single CPU Concurrency just fits the problem

43

Reason #4 To cope with independent physical devices Example

Let’s assume that we want to write a program that receives data from the network, processes it, and writes output to the disk

We can read from the network and write to disk at the same time “almost” for free

We can compute on the data while I receive from the network “almost” for free

We can compute on the data while I write the previously computed data to disk “almost” for free

We are better off writing this program as three concurrent threads (even if on a single CPU)

Each thread uses one “independent” device of the computer

44

Agenda


45

A brief history of concurrency

First machines were used in “single-user mode” The user would declare: “I am going to use the machine

for 2PM till 4PM” Then the user would go in the special machine room and sit there

for 2 hours The user punches cards, which were prepared in advance The user tries to run the program The user tries to debug the program etc. etc.

Extreme lack of productivity During the user’s “thinking time”, the multi-million $ machine

practically does nothing!

46


Batch Processing! Instead of reserving the machine for a lapse of

time to do all the activities (including debugging), the user just “submits” requests to a “queue”

The queue serves requests in order (possibly with priorities)

When the program fails and stops, another program is scheduled to use the machine “immediately”

Great! But how about the CPU idle time during the I/O!

47


48

A brief history of concurrency Multi-programming!

Multiple programs reside in memory at once Required interrupts and memory protection Interrupts are used to switch programs between

devices and CPUs Concurrency issues in the O/S

race conditions, deadlocks, critical sections semaphores, monitors, etc. beginning of theory of concurrent systems (1960)

Increase in memory size Development of virtual memory

49

A brief history of concurrency Multiprogramming system

three jobs in memory

50

A brief history of concurrency Time-sharing!

For fast, interactive response, one needs fast context switching

Makes it possible to have the illusion that one is alone on a (perhaps slower) machine

Already common by 1970 Led to concurrency in user applications!

The user’s application is “logically” two concurrent tasks

The user can now implement it as two concurrent tasks!

51

A brief history of concurrency Technology advances!

Multiple CPUs on a motherboard faster buses, shared-memory, cache coherency

Networked computers distributed memory

Clusters, ..., Internet

Concurrency across CPUs Also: Concurrency within the CPU at the hardware

level Beyond CPU and I/O devices Multiple units (e.g., ALUs) Vector processors Pipelining

52

History, Another Perspective 1960s: Scalar processor

Process one data item at a time 1970s: Vector processor

Can process an array of data items at one go Architecture Overhead

Later 1980s: Massively Parallel Processing (MPP) Up to thousands of processors, each with its own memory and OS Break down a problem

Later 1990s: Cluster Not a new term itself, but renewed interests Connecting stand-alone computers with high-speed network

Later 1990s: Grid Tackle collaboration Draw an analogue from Power grid

53

Issues with concurrency Concurrency appears at all levels of current

systems hardware O/S Application

Many fields of computer science study concurrency issues

Three main issues Performance Correctness Programmability

54

Many connected “areas” Computer architecture Networking Operating Systems Scientific Computing Theory of Distributed Systems Theory of Algorithms and Complexity Scheduling Internetworking Programming Languages Distributed Systems High Performance Computing

Documents

High-Performance Grid Computing and Research Networking