View
224
Download
2
Category
Tags:
Preview:
Citation preview
1
High-Performance Grid Computing and High-Performance Grid Computing and Research NetworkingResearch Networking
Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/
sadjadi At cs Dot fiu Dot edu
Introduction to Introduction to High Performance ComputingHigh Performance Computing
2
Acknowledgements The content of many of the slides in this lecture notes have
been adopted from the online resources prepared previously by the people listed below. Many thanks!
Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova henric@hawaii.edu
Ligang He http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk
Kai Wang Department of Computer Science University of South Dakota http://www.usd.edu/~Kai.Wang
Kyril Faenov Director of High Performance Computing Windows Server Group
Andrew Tanenbaum
3
Agenda
HPC Introduction HPC Applications HPC Goals Concurrency History
4
High Performance Computing
Difficult to define - it’s a moving target. In 1980s:
a “supercomputer” was performing 100 Mega FLOPS FLOPS: FLoating point Operations Per Second
Today: a 2G Hz desktop/laptop performs a few Giga FLOPS a “supercomputer” performs tens of Tera FLOPS
(Top500) High Performance Computing: loosely an order of 1000
times more powerful than the latest desktops
5
Units of Measure in HPC High Performance Computing (HPC) units are:
Flops: floating point operations Flop/s: floating point operations per second Bytes: size of data (double precision floating point number is 8)
Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576)Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824)Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776)Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624)Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte
6
Metric Units
The principal metric prefixes.
7
High Performance Computing
HPC: “The term high performance computing (HPC)
refers to the use of (parallel) supercomputers and computer clusters, that is, computing systems comprised of multiple (usually mass-produced) processors linked together in a single system with commercially available interconnects.”
Wikipedia “This is in contrast to mainframe computers,
which are generally monolithic in nature.” Wikipedia
8
High Performance Computing HPC:
“The more current and evolving definition of HPC refers to High Productivity Computing, and reflects the purpose and use model of the myriad of existing and evolving architectures, and the supporting ecosystem of software, middleware, storage, networking and tools behind the next generation of applications.”
Wikipedia
Parallel Computing: Computing on parallel computers
Super Computing: Computing on top 500 machines
9
High Performance Computing The definition that we use in this course
“How do we make computers to compute bigger problems faster?”
Three main issues Hardware: How do we build faster computers? Software: How do we write faster programs? Hardware and Software: How do they interact?
Many perspectives architecture systems programming modeling and analysis simulation algorithms and complexity
Theory
Practice
10
High Performance Computing HPC Related Technologies
HPC is an all-encompassing term for related technologies that continually push computing boundaries.
1. Computer architecture CPU, memory, VLSI
2. Compilers Identify inefficient implementations Make use of the characteristics of the computer architecture Choose suitable compiler for a certain architecture
3. Algorithms (for parallel and distributed systems) How to program on parallel and distributed systems
4. Middleware From Grid computing technology Application->middleware->operating system Resource discovery and sharing
11
High Performance Computing The key techniques for making computers compute “bigger
problems faster” is to use multiple computers at once Later in this lecture, we will learn why!
This is called parallelism It takes 1000 hours for this program to run on one computer!
Well, if I use 100 computers maybe it will take only 10 hours?! This computer can only handle a dataset that’s 2GB!
So maybe if I use 100 computers I can deal with a 200GB dataset?!
We will spend enough time to learn and experience different flavors of parallel computing
shared-memory parallelism distributed-memory parallelism hybrid parallelism
12
Agenda
HPC Introduction HPC Applications HPC Goals Concurrency History
13
Words of Wisdom “Four or five computers should be enough for the entire world
until the year 2000.” T.J. Watson, Chairman of IBM, 1945.
“640KB [of memory] ought to be enough for anybody.” Bill Gates, Chairman of Microsoft,1981.
You may laugh at their vision today, but … Lesson learned: Don’t be too visionary and try to make things work! ;)
We now know this was not quite true! Games Digital video/images Databases Operating systems
But the first people to really need more computing oomph where scientists
And they go way back
14
Evolution of Science Traditional scientific and engineering:
1) Do theory or paper design
2) Perform experiments or build system Limitations:
Too difficult -- build large wind tunnels Too expensive -- build a throw-away airplane Too slow -- wait for climate or galactic evolution Too dangerous -- weapons, drug design, climate
experiments Solution:
3) Use high performance computer systems to simulate the phenomenon
15
Scientific Computing Use of computers to solve/compute scientific
models For instance, many natural phenomena can be well
approximated by differential equations Classic Example: Heat Transfer
Consider a “1-D” material between 2 heat sources
T = H T = L
x
16
Scientific Computing Use of computers to solve/compute scientific
models For instance, many natural phenomena can
be well approximated by partial differential equations (PDEs)
Problem: compute f(x,t)
T = H T = L
0 < x < X
f(x,t): temperature at location x at time t
17
Heat Transfer
The laws of physics say that:
where alpha depends on the material where f(0,t) = H, f(X,t) = L and f(x,0) are all fixed
Called the boundary conditions
Question: How do we solve this PDE? It does not have an analytical solution Therefore it must be solved numerically (i.e., via
approximation)
18
Heat Transfer One well-known methods to solve the heat equation
is called “finite differences” Approach:
Discretize the domain: decide that the values of f(x,t) will only be known for some finite (but large) number of values of x and t
The discretized domain is called a mesh All x values are separated by ∆x All t values are separated by ∆t
Then, one replaces partial derivatives by algebraic differences
In the limit, when ∆x and ∆t go to zero, we get close to the real solution
19
Heat Transfer There are many different approximations of the partial
derivatives, based on Taylor series developments, etc. For instance, denoting f(x,t) as (discrete) fi,m, we can write the
“Forward Time, Centered Space” (FTCS) heat transfer equation as:
The various discretizations of the heat transfer equation have advantages and drawbacks in terms of
complexity numerical stability (if you’re into it, there are countless papers and textbooks)
We have transfer a difficult PDE into some type of algebraic induction!
Easy to compute in an iterative fashion Given all the values at time m, one can compute all the values at time
m+1
20
Heat Transfer Summary
But they all use some matrix or volume of numbers (in the 2-D and 3-D cases) and iteratively do additions, multiplications and divisions, for many iterations
Therefore, we can replace difficult calculus by simple computations on multi-dimensional arrays of numbers
Challenges These matrices may be really big, for better resolution and
larger domains Large Data The number of additions and multiplications can be
overwhelming Heavy Computation Hence
the early and always constant need of scientists to get bigger memories and faster CPUs
21
HPC Applications Science
Global climate modeling Astrophysical modeling Biology: genomics; protein folding; drug design Computational Chemistry Computational Material Sciences and Nanosciences
Engineering Crash simulation Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design)
Business Financial and economic modeling Transaction processing, web services and search engines
Defense Nuclear weapons -- test by simulation Cryptography
22
Example: Computational Fluid Dynamics (CFD)
Replacing NASA’s Wind Tunnels with Computers
23
Example: Global Climate Problem is to compute:
f (latitude, longitude, elevation, time) temperature, pressure, humidity, wind velocity
Approach: Discretize the domain, e.g., a measurement point every 10 km Devise an algorithm to predict weather at time t+1 given t
Uses: Predict El Nino Set air emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
24
Global Climate Requirements One piece is modeling the fluid flow in the atmosphere
Solve Navier-Stokes problem Roughly 100 Flops per grid point with 1 minute timestep
Computational requirements: To match real-time, need 5x1011 flops in 60 seconds = 8 Gflop/s Weather prediction (7 days in 24 hours) 56 Gflop/s Climate prediction (50 years in 30 days) 4.8 Tflop/s To use in policy negotiations (50 years in 12 hours) 288 Tflop/s
Let’s make it even worse! To 2x grid resolution, computation is > 8x State of the art models require integration of atmosphere, ocean, sea-
ice, land models, plus possibly carbon cycle, geochemistry and more
Current models are coarser than this!
25High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL
27
Agenda
HPC Introduction HPC Applications HPC Goals Concurrency History
28
Goals of HPC Minimize turn-around time
to complete specific application problems (strong scaling)
Maximise the problem size that can be solved given a set amount of time
(weak scaling) Identify compromise between
performance and cost. Note: Most supercomputers are obsolete
in terms of performance before the end of their physical life.
29
Maximizing Performance How is performance maximized?
Reduce the time per instruction (cycle time) [1]: clock rate. Increase the number of instructions executed per-cycle [2]: pipelining. Allow multiple processors to work on different parts of the same
program at the same time [3]: parallel execution. When performance is gained from [1] and [2]
There is a limit to how quick processors will operate. Speed of light and electricity. Heat dissipation. Power consumption An instruction processing procedure cannot be divided into infinite
stages When performance improvements come from [3]
Overhead of communications
30
A 10 TFlop/s CPU? Question: Could we build a single CPU that delivers 10,000
billion floating point operations per second (10 TFlops), and operates over 10,000 billion bytes (10 TByte)?
Representative of what many scientists need today. Clock rate has to be 10,000 GHz Assume that data travels at the speed of light Assume that the computer is an “ideal” sphere
CPU
31
A 10 TFlop/s CPU? Assume that the machine issues one instruction per cycle
therefore the clock rate must be 10,000GHz ~ 1013 Hz Data must travel some distance from the memory to the CPU
Assume that Each instruction will need at least one 8 bytes of memory Assume that data travels at the speed of light c=3x108 m/s Then the distance between the memory and the CPU must be r < c / 1013 ~
3x10-6 m Then we must have 1013 bytes of memory in 4/3r3 = 3.7e-17 m3
Therefore, each word of memory must occupy 3.7e-30 m3
This is 3.7 Angstrom3
Or the volume of a very small molecule that consists of only a few atoms
Current memory densities are 10GB/cm3, or about a factor 1020 from what would be needed!
Conclusion: It’s not going to happen until some scifi breakthrough happens
32
Agenda
HPC Introduction HPC Applications HPC Goals Concurrency History
33
Concurrency Since we cannot conceivably build a single CPU to solve
relevant scientific problems, we resort to concurrency: execution of multiple “tasks” at the “same” time
Concurrency is everywhere in computers Load a word from memory while adding two registers Adding two pairs of registers at the same time Receiving data from the network while writing to disk Dual-proc systems Clusters of workstations SETI@home
Some concurrency is “true” meaning that things really happen at the same time
Some concurrency is just the illusion of simultaneous execution, with rapid switching among activities
34
Concurrent, parallel, distributed? “Concurrency” is typically the more general term
A program is said to be concurrent if it contains more than one execution context
e.g., more than one thread/process
Typically the word “parallel” implies some notion of high performance / scientific application running on a single hardware platform
The word “distributed” typically refers to applications that run on multiple computers that may not be in the same room
These terms are conflated and misused all the time; in different research communities they mean different things.
We’ll see that distinctions are disappearing anyway
35
Two Types of HPC Parallel Computing
Breaking the problem to be computed into parts that can be run simultaneously in different processors
Distributed Computing Parts of the work to be computed are computed in
different places Note: does not necessarily imply simultaneous
processing An example: C/S model Solve loosely-coupled problems
(no much communication)
36
Parallel Computing Architectures of Parallel Computing
SMP (Symmetric Multi-Processing) Multiple CPUs, single memory, shared I/O All resources in a SMP machine are equally available to each
CPU Does not scale well to a large number of processors (less than 8)
NUMA (Non-Uniform Memory Access) Multiple CPUs Each CPU has fast access to its local area of the memory, but
slower access to other areas Scale well to a large number of processors Complicated memory access pattern
MPP (Massively Parallel Processing) Cluster
37
Reasons for Concurrency
Concurrency arises for at least 4 reasons1. To increase performance or memory capacity
2. To allow users and computers to “collaborate”
3. To capture the logical structure of a problem
4. To cope with independent physical devices
38
Reason #1 To increase performance
39
Reason #1 (cont.) To increase memory capacity Example
A 3D weather simulation over Kaneohe Bay (1-meter resolution) Say we consider a volume 2km x 2km x 1km over the bay Each zone is characterized by, say, temperature, wind direction,
wind velocity, air pressure, air moisture, for a total of (1+3+1+1+1)*8 = 56 bytes
Therefore we need about 208GB of memory to hold the data Option #1: Buy a machine with > 208 GB RAM
96GB server from Sun: about 1 million dollar! They have a 288GB configuration (contact them for price) There is a 3TB shared-memory SGI machine at NCSA
Option #2: Couple individual machines together Buy 52 4GB Power-edge servers from Dell for 2.5K each Slap some network on them and you’ve got enough memory total cost: ~ 200K
But: it’s not as simple as that!
40
Reason #1 (cont.)Interferometer Gravitational Wave Observatory (LIGO)Tiny distortions of space and time caused when very large masses, such as stars, move suddenly.1TB/day (1024 GB/day), Year-long experiments
The Compact Muon SolenoidAt CERN, designed to study proton-proton collisions with high quality measurements (12,000 tons)10 GB/sec!!!Many PB/year (1024 TB/year)
41
Reason #2 To allow users and computers to collaborate Example
Assume that we want to allow users to do on-line purchases
We need Web browsers, Web servers, Database servers
All these are processes They all communicate with multiple processes
simultaneously, they are all multithreaded, running on multiple machines, some of them are multi-proc servers
It’s just a big concurrent system and it is critical that it be fast and correct!
42
Reason #3 To capture the logical structure of a problem Example
Let’s assume that we want to write a program that simulates the interactions between a robot and living entities
We can implement the robot as its own thread The code is just the code of the robot
We can implement each entity as its own thread The code is the simulation of the entity’s behavior
Now we let them “loose” at the beginning They may meet, interact, etc. All of this happens without a central notion of control, although I
may be running on a single CPU Concurrency just fits the problem
43
Reason #4 To cope with independent physical devices Example
Let’s assume that we want to write a program that receives data from the network, processes it, and writes output to the disk
We can read from the network and write to disk at the same time “almost” for free
We can compute on the data while I receive from the network “almost” for free
We can compute on the data while I write the previously computed data to disk “almost” for free
We are better off writing this program as three concurrent threads (even if on a single CPU)
Each thread uses one “independent” device of the computer
44
Agenda
HPC Introduction HPC Applications HPC Goals Concurrency History
45
A brief history of concurrency
First machines were used in “single-user mode” The user would declare: “I am going to use the machine
for 2PM till 4PM” Then the user would go in the special machine room and sit there
for 2 hours The user punches cards, which were prepared in advance The user tries to run the program The user tries to debug the program etc. etc.
Extreme lack of productivity During the user’s “thinking time”, the multi-million $ machine
practically does nothing!
46
A brief history of concurrency
Batch Processing! Instead of reserving the machine for a lapse of
time to do all the activities (including debugging), the user just “submits” requests to a “queue”
The queue serves requests in order (possibly with priorities)
When the program fails and stops, another program is scheduled to use the machine “immediately”
Great! But how about the CPU idle time during the I/O!
47
A brief history of concurrency
48
A brief history of concurrency Multi-programming!
Multiple programs reside in memory at once Required interrupts and memory protection Interrupts are used to switch programs between
devices and CPUs Concurrency issues in the O/S
race conditions, deadlocks, critical sections semaphores, monitors, etc. beginning of theory of concurrent systems (1960)
Increase in memory size Development of virtual memory
49
A brief history of concurrency Multiprogramming system
three jobs in memory
50
A brief history of concurrency Time-sharing!
For fast, interactive response, one needs fast context switching
Makes it possible to have the illusion that one is alone on a (perhaps slower) machine
Already common by 1970 Led to concurrency in user applications!
The user’s application is “logically” two concurrent tasks
The user can now implement it as two concurrent tasks!
51
A brief history of concurrency Technology advances!
Multiple CPUs on a motherboard faster buses, shared-memory, cache coherency
Networked computers distributed memory
Clusters, ..., Internet
Concurrency across CPUs Also: Concurrency within the CPU at the hardware
level Beyond CPU and I/O devices Multiple units (e.g., ALUs) Vector processors Pipelining
52
History, Another Perspective 1960s: Scalar processor
Process one data item at a time 1970s: Vector processor
Can process an array of data items at one go Architecture Overhead
Later 1980s: Massively Parallel Processing (MPP) Up to thousands of processors, each with its own memory and OS Break down a problem
Later 1990s: Cluster Not a new term itself, but renewed interests Connecting stand-alone computers with high-speed network
Later 1990s: Grid Tackle collaboration Draw an analogue from Power grid
53
Issues with concurrency Concurrency appears at all levels of current
systems hardware O/S Application
Many fields of computer science study concurrency issues
Three main issues Performance Correctness Programmability
54
Many connected “areas” Computer architecture Networking Operating Systems Scientific Computing Theory of Distributed Systems Theory of Algorithms and Complexity Scheduling Internetworking Programming Languages Distributed Systems High Performance Computing
Recommended