Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More Dr. Jay Boisseau Director, Texas Advanced

Introduction toHigh Performance Computing:

Parallel Computing, Distributed Computing, Grid Computing and More

Dr. Jay Boisseau

Director, Texas Advanced Computing Center

[email protected]

December 3, 2001

The University of Texas at AustinTexas Advanced Computing Center

mailto:[email protected]

Introduction to High Performance Computing

Outline

• Preface

• What is High Performance Computing?

• Parallel Computing

• Distributed Computing, Grid Computing, and More

• Future Trends in HPC


Purpose

• Purpose of this workshop:– to educate researchers about the value and

impact of high performance computing (HPC) techniques and technologies in conducting computational science and engineering

• Purpose of this presentation:– to educate researchers about the techniques and

tools of parallel computing, and to show them the possibilities presented by distributed computing and Grid computing


Goals

• Goals of this presentation are to help you:1. understand the ‘big picture’ of high performance

computing

2. develop a comprehensive understanding of parallel computing

3. begin to understand how Grid and distributed computing will further enhance computational science capabilities


Content and Context

• This material is an introduction and an overview– It is not a comprehensive HPC, so further reading

(much more!) is recommended.

• Presentation is followed by additional speakers with detailed presentations on specific HPC and science topics

• Together, these presentations will help prepare you to use HPC in your scientific discipline.


Background - me

• Director of the Texas Advanced Computing Center (TACC) at the University of Texas

• Formerly at San Diego Supercomputer Center (SDSC), Artic Region Supercomputing Center

• 10+ years in HPC

• Known Luis for 4 years - plan to develop strong relationship between TACC and CeCalCULA


Background – TACC

• Mission:– to enhance the academic research capabilities of

the University of Texas and its affiliates through the application of advanced computing resources and expertise

• TACC activities include:– Resources– Support– Development– Applied research


TACC Activities

• TACC resources and support includes:– HPC systems – Scientific visualization resources– Data storage/archival systems

• TACC research and development areas: – HPC– Scientific Visualization– Grid Computing


Current HPC Systems

FDDI

HiPPI

CRAY SV116 CPU, 16GB

Memory

ARCHIVE640GB

CRAY T3E256+ procs

128 MB/proc

500GBaurora

golden

IBM SP64+ procs

256 MB/proc

azure

300GB

AscendRouter


New HPC Systems

• Four IBM p690 HPC servers– 16 Power4 Processors

• 1.3 GHz: 5.2 Gflops per proc,83.2 Gflops per server

– 16 GB Shared Memory• >200 GB/s memory bandwidth!

– 144 GB Disk

• 1 TB disk to partition across servers

• Will configure as single system (1/3 Tflop) with single GPFS system (1 TB) in 2Q02


New HPC Systems

• IA64 Cluster– 20 2-way nodes

• Itanium (800 MHz) processors

• 2 GB memory/node

• 72 GB disk/node

– Myrinet 2000 switch – 180GB shared disk

• IA32 Cluster– 32 2-way nodes

• Pentium III (1 GHz) processors

• 1 GB Memory

• 18.2 GB disk/node

– Myrinet 2000 Switch

750 GB IBM GPFS parallel file system for both clusters


World-Class Vislab

• SGI Onyx2– 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines– 24 GB Memory, 750 GB Disk

• Front and Rear Projection Systems– 3x1 cylindrically-symmetric Power Wall– 5x2 large-screen, 16:9 panel Power Wall

• Matrix switch between systems, projectors, rooms


More Information

• URL: www.tacc.utexas.edu

• E-mail Addresses:– General Information: [email protected]– Technical assistance: [email protected]

• Telephone Numbers:– Main Office: (512) 475-9411– Facsimile transmission: (512) 475-9445– Operations Room: (512) 475-9410


Outline

• Preface






‘Supercomputing’

• First HPC systems were vector-based systems (e.g. Cray)– named ‘supercomputers’ because they were an

order of magnitude more powerful than commercial systems

• Now, ‘supercomputer’ has little meaning– large systems are now just scaled up versions of

smaller systems

• However, ‘high performance computing’ has many meanings


HPC Defined

• High performance computing:– can mean high flop count

• per processor• totaled over many processors working on the same

problem• totaled over many processors working on related

problems

– can mean faster turnaround time• more powerful system• scheduled to first available system(s)• using multiple systems simultaneously


My Definitions

• HPC: any computational technique that solves a large problem faster than possible using single, commodity systems– Custom-designed, high-performance processors

(e.g. Cray, NEC)– Parallel computing– Distributed computing– Grid computing


My Definitions

• Parallel computing: single systems with many processors working on the same problem

• Distributed computing: many systems loosely coupled by a scheduler to work on related problems

• Grid Computing: many systems tightly coupled by software and networks to work together on single problems or on related problems


Importance of HPC

• HPC has had tremendous impact on all areas of computational science and engineering in academia, government, and industry.

• Many problems have been solved with HPC techniques that were impossible to solve with individual workstations or personal computers.


Outline

• Preface






What is a Parallel Computer?

• Parallel computing: the use of multiple computers or processors working together on a common task

• Parallel computer: a computer that contains multiple processors:– each processor works on its section of the

problem– processors are allowed to exchange information

with other processors


Parallel vs. Serial Computers

• Two big advantages of parallel computers:1. total performance

2. total memory

• Parallel computers enable us to solve problems that:– benefit from, or require, fast solution– require large amounts of memory– example that requires both: weather forecasting


Parallel vs. Serial Computers

• Some benefits of parallel computing include:– more data points

• bigger domains• better spatial resolution• more particles

– more time steps • longer runs• better temporal resolution

– faster execution• faster time to solution• more solutions in same time• lager simulations in real time


Serial Processor Performance

0

20

40

60

1 6

Time (years)

per

form

ance

Moore'sLaw

Future(?)

Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached


Types of Parallel Computers

• The simplest and most useful way to classify modern parallel computers is by their memory model:– shared memory– distributed memory


P P P P P P

BUS

Memory

M

P

M

P

M

P

M

P

M

P

M

P

Network

Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000)

Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)

Shared vs. Distributed Memory


P P P P P P

BUS

Memory

Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000)

P P P P

BUS

Memory

P P P P

BUS

Memory

Network

Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)

Shared Memory: UMA vs. NUMA


Distributed Memory: MPPs vs. Clusters

• Processor-memory nodes are connected by some type of interconnect network– Massively Parallel Processor (MPP): tightly

integrated, single system image.– Cluster: individual computers connected by s/w

CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM CPU

MEM

CPU

MEM

InterconnectNetwork


Processors, Memory, & Networks

• Both shared and distributed memory systems have:1. processors: now generally commodity RISC

processors

2. memory: now generally commodity DRAM

3. network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)

• We will now begin to describe these pieces in detail, starting with definitions of terms.


Processor-Related Terms

Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz).

Instruction: an action executed by a processor, such as a mathematical operation or a memory operation.

Register: a small, extremely fast location for storing data or instructions in the processor.



Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc.

Pipeline : technique enabling multiple instructions to be overlapped in execution.

Superscalar: multiple instructions are possible per clock period.

Flops: floating point operations per second.



Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly.

Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)


Memory-Related Terms

SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable.

DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper).

Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.


Interconnect-Related Terms

• Latency: – Networks: How long does it take to start sending a

"message"? Measured in microseconds.– Processors: How long does it take to output

results of some operations, such as floating point add, divide etc., which are pipelined?)

• Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec or Gbytes/sec


Interconnect-Related Terms

Topology: the manner in which the nodes are connected. – Best choice would be a fully connected network

(every processor to every other). Unfeasible for cost and scaling reasons.

– Instead, processors are arranged in some variation of a grid, torus, or hypercube.

3-d hypercube 2-d mesh 2-d torus


Processor-Memory Problem

• Processors issue instructions roughly every nanosecond.

• DRAM can be accessed roughly every 100 nanoseconds (!).

• DRAM cannot keep processors busy! And the gap is growing:– processors getting faster by 60% per year– DRAM getting faster by 7% per year (SDRAM and

EDO RAM might help, but not enough)


Processor-Memory Performance Gap

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

19

80

19

81

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

“Moore’s Law”

From D. Patterson, CS252, Spring 1998 ©UCB


Processor-Memory Performance Gap

• Problem becomes worse when remote (distributed or NUMA) memory is needed– network latency is roughly 1000-10000

nanoseconds (roughly 1-10 microseconds)– networks getting faster, but not fast enough

• Therefore, cache is used in all processors– almost as fast as processors (same circuitry)– sits between processors and local memory– expensive, can only use small amounts– must design system to load cache effectively


CPU

Main Memory

Cache

Processor-Cache-Memory

• Cache is much smaller than main memory and hence there is mapping of data from main memory to cache.


CPU

Cache

LocalMemory

RemoteMemory

SPEED SIZE COST/BIT

Memory Hierarchy


Cache-Related Terms

• ICACHE : Instruction cache

• DCACHE (L1) : Data cache closest to registers

• SCACHE (L2) : Secondary data cache– Data from SCACHE has to go through DCACHE

to registers– SCACHE is larger than DCACHE – Not all processors have SCACHE


Cache Benefits

• Data cache was designed with two key concepts in mind– Spatial Locality

• When an element is referenced its neighbors will be referenced also

• Cache lines are fetched together• Work on consecutive data elements in the same cache

line

– Temporal Locality• When an element is referenced, it might be referenced

again soon• Arrange code so that data in cache is reused often


cache

main memory

Direct-Mapped Cache

• Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.


cache

Main memory

Fully Associative Cache

• Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be

associated with any entry in the cache.


2-way set-associative cache

Main memory

Set Associative Cache

• Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into N (N > 1) locations in the cache.


Cache-Related Terms

Least Recently Used (LRU): Cache replacement strategy for set associative caches. The cache block that is least recently used is replaced with a new block.

Random Replace: Cache replacement strategy for set associative caches. A cache block is randomly replaced.


Example: CRAY T3E Cache

• The CRAY T3E processors can execute– 2 floating point ops (1 add, 1 multiply) and– 2 integer/memory ops (includes 2 loads or 1 store)

• To help keep the processors busy– on-chip 8 KB direct-mapped data cache– on-chip 8 KB direct-mapped instruction cache– on-chip 96 KB 3-way set associative secondary

data cache with random replacement.


Putting the Pieces Together

• Recall:– Shared memory architectures:

• Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000

• Non-Uniform Memory Access (NUMA): Most common are Distributed Shared Memory (DSM), or cc-NUMA (cache coherent NUMA) systems. Ex: SGI Origin 2000

– Distributed memory architectures:• Massively Parallel Processor (MPP): tightly integrated

system, single system image. Ex: CRAY T3E, IBM SP• Clusters: commodity nodes connected by interconnect.

Example: Beowulf clusters.


Symmetric Multiprocessors (SMPs)

• SMPs connect processors to global shared memory using one of:– bus– crossbar

• Provides simple programming model, but has problems:– buses can become saturated– crossbar size must increase with # processors

• Problem grows with number of processors, limiting maximum size of SMPs


Shared Memory Programming

• Programming models are easier since message passing is not necessary. Techniques:– autoparallelization via compiler options– loop-level parallelism via compiler directives– OpenMP– pthreads

• More on programming models later.


Massively Parallel Processors

• Each processor has it’s own memory:– memory is not shared globally– adds another layer to memory hierarchy (remote

memory)

• Processor/memory nodes are connected by interconnect network– many possible topologies– processors must pass data via messages– communication overhead must be minimized


Communications Networks

• Custom– Many vendors have custom interconnects that

provide high performance for their MPP system– CRAY T3E interconnect is the fastest for MPPs:

lowest latency, highest bandwidth

• Commodity– Used in some MPPs and all clusters– Myrinet, Gigabit Ethernet, Fast Ethernet, etc.


Types of Interconnects

• Fully connected– not feasible

• Array and torus– Intel Paragon (2D array), CRAY T3E (3D torus)

• Crossbar– IBM SP (8 nodes)

• Hypercube– SGI Origin 2000 (hypercube), Meiko CS-2 (fat tree)

• Combinations of some of the above– IBM SP (crossbar & fully connected for 80 nodes)– IBM SP (fat tree for > 80 nodes)


Clusters

• Similar to MPPs– Commodity processors and memory

• Processor performance must be maximized

– Memory hierarchy includes remote memory– No shared memory--message passing

• Communication overhead must be minimized

• Different from MPPs– All commodity, including interconnect and OS– Multiple independent systems: more robust– Separate I/O systems


Cluster Pros and Cons

• Pros– Inexpensive– Fastest processors first– Potential for true parallel I/O– High availability

• Cons:– Less mature software (programming and system)– More difficult to manage (changing slowly)– Lower performance interconnects: not as scalable

to large number (but have almost caught up!)


Distributed Memory Programming

• Message passing is most efficient– MPI– MPI-2– Active/one-sided messages

• Vendor: SHMEM (T3E), LAPI (SP• Coming in MPI-2

• Shared memory models can be implemented in software, but are not as efficient.

• More on programming models in the next section.


“Distributed Shared Memory”

• More generally called cc-NUMA (cache coherent NUMA)

• Consists of m SMPs with n processors in a global address space:– Each processor has some local memory (SMP)– All processors can access all memory: extra

“directory” hardware on each SMP tracks values stored in all SMPs

– Hardware guarantees cache coherency– Access to memory on other SMPs slower (NUMA)


“Distributed Shared Memory”

• Easier to build because of slower access to remote memory (no expensive bus/crossbar)

• Similar cache problems

• Code writers should be aware of data distribution

• Load balance: Minimize access of “far” memory


DSM Rationale and Realities

• Rationale: combine ease of SMP programming with scalability of MPP programming at much at cost of MPP

• Reality: NUMA introduces additional layers in SMP memory hierarchy relative to SMPs, so scalability is limited if programmed as SMP

• Reality: Performance and high scalability require programming to the architecture.


Clustered SMPs

• Simpler than DSMs:– composed of nodes connected by network, like an

MPP or cluster– each node is an SMP– processors on one SMP do not share memory on

other SMPs (no directory hardware in SMP nodes)– communication between SMP nodes is by

message passing– Ex: IBM Power3-based SP systems


Clustered SMP Diagram

Network

P P P P

BUS

Memory

P P P P

BUS

Memory


Reasons for Clustered SMPs

• Natural extension of SMPs and clusters– SMPs offer great performance up to their

crossbar/bus limit– Connecting nodes is how memory and

performance are increased beyond SMP levels– Can scale to larger number of processors with less

scalable interconnect– Maximum performance:

• Optimize at SMP level - no communication overhead• Optimize at MPP level - fewer messages necessary for

same number of processors


Clustered SMP Drawbacks

• Clustering SMPs has drawbacks– No shared memory access over entire system,

unlike DSMs– Has other disadvantages of DSMs

• Extra layer in memory hierarchy• Performance requires more effort from programmer than

SMPs or MPPs

• However, clustered SMPs provide a means for obtaining very high performance and scalability


Clustered SMP: NPACI “Blue Horizon”

• IBM SP system:– Power3 processors: good peak performance (~1.5

Gflops)– better sustained performance (highly superscalar

and pipelined) than for many other processors– SMP nodes have 8 Power3 processors– System has 144 SMP nodes (1154 processors

total)


Programming Clustered SMPs

• NSF: Most users use only MPI, even for intra- node messages

• DoE: Most applications are being developed with MPI (between nodes) and OpenMP (intra-node)

• MPI+OpenMP programming is more complex, but might yield maximum performance

• Active messages and pthreads would theoretically give maximum performance


Data parallelism Task parallelism

Types of Parallelism

• Data parallelism: each processor performs the same task on different sets or sub-regions of data

• Task parallelism: each processor performs a different task

• Most parallel applications fall somewhere on the continuum between these two extremes.


Data vs. Task Parallelism

• Example of data parallelism:– In a bottling plant, we see several ‘processors’, or

bottle cappers, applying bottle caps concurrently on rows of bottles.

• Example of task parallelism;– In a restaurant kitchen, we see several chefs, or

‘processors’, working simultaneously on different parts of different meals.

– A good restaurant kitchen also demonstrates load balancing and synchronization--more on those topics later.


Example: Master-Worker Parallelism

• A common form of parallelism used in developing applications years ago (especially in PVM) was Master-Worker parallelism:– a single processor is responsible for distributing

data and collecting results (task parallelism)– all other processors perform same task on their

portion of data (data parallelism)


Parallel Programming Models

• The primary programming models in current use are– Data parallelism - operations are performed in

parallel on collections of data structures. A generalization of array operations.

– Message passing - processes possess local memory and communicate with other processes by sending and receiving messages.

– Shared memory - each processor has access to a single shared pool of memory


Parallel Programming Models

• Most parallelization efforts fall under the following categories.– Codes can be parallelized using message-passing

libraries such as MPI.– Codes can be parallelized using compiler

directives such as OpenMP.– Codes can be written in new parallel languages.


Programming Models Architectures

• Natural mappings– data parallel CM-2 (SIMD machine)

– message passing IBM SP (MPP)

– shared memory SGI Origin, Sun E10000

• Implemented mappings– HPF (a data parallel language) and MPI (a

message passing library) have been implemented on nearly all parallel machines

– OpenMP (a set of directives, etc. for shared memory programming) has been implemented on most shared memory systems.


SPMD

• All current machines are MIMD systems (Multiple Instruction, Multiple Data) and are capable of either data parallelism or task parallelism.

• The primary paradigm for programming parallel machines is the SPMD paradigm: Single Program, Multiple Data– each processor runs a copy of same source code– enables data parallelism (through data

decomposition) and task parallelism (through intrinsic functions that return the processor ID)


OpenMP - Shared Memory Standard

• OpenMP is a new standard for shared memory programming: SMPs and cc-NUMAs.– OpenMP provides a standard set of directives,

run-time library routines, and– environment variables for parallelizing code under

a shared memory model.– Very similar to Cray PVP autotasking directives,

but with much more functionality. (Cray now uses supports OpenMP.)

– See http://www.openmp.org for more information


program add_arraysparameter (n=1000)real x(n),y(n),z(n)read(10) x,y,z

do i=1,n x(i) = y(i) + z(i)enddo...end

Fortran 77

program add_arraysparameter (n=1000)real x(n),y(n),z(n)read(10) x,y,z

!$OMP PARALLEL DOdo i=1,n x(i) = y(i) + z(i)enddo...end

Fortran 77 + OpenMP

Highlighted directive specifies that loop is executed in parallel. Each processor executes a subset of the loop iterations.

OpenMP Example


MPI - Message Passing Standard

• MPI has emerged as the standard for message passing in both C and Fortran programs. No longer need to know MPL, PVM, TCGMSG, etc.

• MPI is both large and small:– MPI is large, since it contains 125 functions which

give the programmer fine control over communications

– MPI is small, since message passing programs can be written using a core set of just six functions.


PE 0 calls MPI_SEND to pass the real variable x to PE 1.PE 1 calls MPI_RECV to receive the real variable y from PE 0

if(myid.eq.0) then call MPI_SEND(x,1,MPI_REAL,1,100,MPI_COMM_WORLD,ierr)endif

if(myid.eq.1) then call MPI_RECV(y,1,MPI_REAL,0,100,MPI_COMM_WORLD, status,ierr)endif

MPI Examples - Send and Receive

MPI messages are two-way: they require a send and a matching receive:


MPI Example - Global Operations

PE 6 collects the single (1) integer value n from all other processors and puts the sum (MPI_SUM) into into sum

call MPI_REDUCE(n,allsum,1,MPI_INTEGER,MPI_SUM,6, MPI_COMM_WORLD,ierr)

MPI also has global operations to broadcast and reduce (collect) information

PE 5 broadcasts the single (1) integer value n to all other processors

call MPI_BCAST(n,1,MPI_INTEGER,5, MPI_COMM_WORLD,ierr)


MPI Implementations

• MPI is typically implemented on top of the highest performance native message passing library for every distributed memory machine.

• MPI is a natural model for distributed memory machines (MPPs, clusters)

• MPI offers higher performance on DSMs beyond the size of an individual SMP

• MPI is useful between SMPs that are clustered

• MPI can be implemented on shared memory machines


Extensions to MPI: MPI-2

• A standard for MPI-2 has been developed which extends the functionality of MPI. New features include:– One sided communications - eliminates the need

to post matching sends and receives. Similar in functionality to the shmem PUT and GET on the CRAY T3E (most systems have analogous library)

– Support for parallel I/O– Extended collective operations– No full implementation yet - it is difficult for

vendors


MPI vs. OpenMP

• There is no single best approach to writing a parallel code. Each has pros and cons:– MPI - powerful, general, and universally available

message passing library which provides very fine control over communications, but forces the programmer to operate at a relatively low level of abstraction.

– OpenMP - conceptually simple approach for creating parallel codes on a shared memory machines, but not applicable to distributed memory platforms.


MPI vs. OpenMP

• MPI is the most general (problems types) and portable (platforms, although not efficient for SMPs)

• The architecture and the problem type often make the decision for you.


Parallel Libraries

• Finally, there are parallel mathematics libraries that enable users to write (serial) codes, then call parallel solver routines :– ScaLAPACK is for solving dense linear system of

equations, eigenvalues and least square problems. Also see PLAPACK.

– PETSc is for solving linear and non-linear partial differential equations (includes various iterative solvers for sparse matrices).

– Many others: check NETLIB for complete survey:http://www.netlib.org


Hurdles in Parallel Computing

There are some hurdles in parallel computing:– Scalar performance: Fast parallel codes require

efficient use of the underlying scalar hardware– Parallel algorithms: Not all scalar algorithms

parallelize well, may need to rethink problem• Communications: Need to minimize the time spent doing

communications• Load balancing: All processors should do roughly the

same amount of work

– Amdahl’s Law: Fundamental limit on parallel computing


Scalar Performance

• Underlying every good parallel code is a good scalar code.

• If a code scales to 256 processors but only gets 1% of peak performance, it is still a bad parallel code.– Good news: Everything that you know about serial

computing will be useful in parallel computing!– Bad news: It is difficult to get good performance

out of the processors and memory used in parallel machines. Need to use cache effectively.


0.1

1

10

100

1 10 100

Number of processors

tim

e

parallel

serial

In this case, the parallel code achieves perfect scaling, but does not match the performance of the serial code until 32 processors are used

Serial Performance


main memory

cache

CPU

A simplified memoryhierarchy

Small& fast

Big& slow

The data cache was designed with two key concepts in mind:

Spatial locality - cache is loaded an entire line (4-32 words) at a time to take advantage of the fact that if a location in memory is required, nearby locations will probably also be required

Temporal locality - once a word is loaded into cache it remains there until the cache line is needed to hold another word of data.

Use Cache Effectively


Non-Cache Issues

• There are other issues to consider to achieve good serial performance:– Force reductions, e.g., replacement of divisions

with multiplications-by-inverse– Evaluate and replace common sub-expressions– Pushing loops inside subroutines to minimize

subroutine call overhead– Force function inlining (compiler option)– Perform interprocedural analysis to eliminate

redundant operations (compiler option)


Parallel Algorithms

• The algorithm must be naturally parallel!– Certain serial algorithms do not parallelize well.

Developing a new parallel algorithm to replace a serial algorithm can be one of the most difficult task in parallel computing.

– Keep in mind that your parallel algorithm may involve additional work or a higher floating point operation count.


Parallel Algorithms

– Keep in mind that the algorithm should• need the minimum amount of communication (Monte

Carlo algorithms are excellent examples)• balance the load among the processors equally

– Fortunately, a lot of research has been done in parallel algorithms, particularly in the area of linear algebra. Don’t reinvent the wheel, take full advantage of the work done by others:

• use parallel libraries supplied by the vendor whenever possible!

• use ScaLAPACK, PETSc, etc. when applicable


Busy timeIdle time

t

PE 0PE 1

The figures below show the timeline for parallel codes run on two processors. In both cases, the total amount of work done is the same, but in the second case the work is distributed more evenly between the two processors resulting in a shorter time to solution.

PE 0PE 1

Synchronizationpoints

Load Balancing


Communications

• Two key parameters of the communications network are– Latency: time required to initiate a message. This

is the critical parameter in fine grained codes, which require frequent interprocessor communications. Can be thought of as the time required to send a message of zero length.

– Bandwidth: steady-state rate at which data can be sent over the network.This is the critical parameter in coarse grained codes, which require infrequent communication of large amounts of data.


Latency and Bandwidth Example

• Bucket brigade: the old style of fighting fires in which the townspeople formed a line from the well to the fire and passed buckets of water down the line– latency - the delay until the first bucket to arrives

at the fire– bandwidth - the rate at which buckets arrive at the

fire


Sequential: t = t(comp) + t(comm)Overlapped: t = t(comp) + t(comm) - t(comp) t(comm)

More on Communications

• Time spent performing communications is considered overhead. Try minimize the impact of communications:– minimize the effect of latency by combining large

numbers of small messages into small numbers of large messages.

– communications and computation do not have to be done sequentially, can often overlap communication and computations


• dial• “Hi mom”• hang up• dial• “How are things?”• hang up• dial• “in the U.S.?”• hang up• dial• At this point many mothers would not pick up the next call.

• dial• “Hi mom. How are things in the U.S.?. Yak, yak...”• hang up

By transmitting a single large message, Ionly have to pay the price for the dialinglatency once. I transmit more informationin less time.

The following examples of “phoning home” illustrate the value of combining many small messages into a single larger one.

Combining Small Messages into Larger Ones


In the following example, a stencil operation is performed on a 10 x 10 array that has been distributed over two processors. Assume periodic boundary conditions.

Boundary elements - requires datafrom neighboring processor

Interior elements

• Initiate communications• Perform computations on interior elements• Wait till communications are finished• Perform computations on boundary elements

Stencil operation:y(i,j)=x(i+1,j)+x(i-1,j)+x(i,j+1)+x(i,j-1)

PE0 PE1

Overlapping Communications and Computations


Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where:fs = serial fraction of codefp = parallel fraction of code = 1 - fs

N = number of processors

Amdahl’s Law


0

50

100

150

200

250

0 50 100 150 200 250

Number of processors

spee

dup

fp = 1.000

fp = 0.999

fp = 0.990

fp = 0.900

It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors

Illustration of Amdahl’s Law


Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250Number of processors

spee

dup Amdahl's Law

Reality

fp = 0.99

Amdahl’s Law Vs. Reality


More on Amdahl’s Law

• Amdahl’s Law can be generalized to any two processes of with different speeds

• Ex.: Apply to fprocessor and fmemory:– The growing processor-memory performance gap

will undermine our efforts at achieving maximum possible speedup!


Generalized Amdahl’s Law

• Amdahl’s Law can be further generalized to handle an arbitrary number of processes of various speeds. (The total fractions representing each process must still equal 1.)

• This is a weighted Harmonic mean. Application performance is limited by performance of the slowest component as much as it is determined by the fastest.

Ravg = 1

fi

R ii 1

N


Gustafson’s Law

• Thus, Amdahl’s Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large.

• There is a way around this: increase the problem size– bigger problems mean bigger grids or more

particles: bigger arrays– number of serial operations generally remains

constant; number of parallel operations increases: parallel fraction increases


The 1st Question to Ask Yourself Before You Parallelize Your Code

• Is it worth my time? – Do the CPU requirements justify parallelization?– Do I need a parallel machine in order to get

enough aggregate memory?– Will the code be used just once or will it be a major

production code?

• Your time is valuable, and it can be very time consuming to write, debug, and test a parallel code. The more time you spend writing a parallel code, the less time you have to spend doing your research.


The 2nd Question to Ask Yourself Before You Parallelize Your Code

• How should I decompose my problem?– Do the computations consist of a large number of

small, independent problems - trajectories, parameter space studies, etc? May want to consider a scheme in which each processor runs the calculation for a different set of data

– Does each computation have large memory or CPU requirements? Will probably have to break up a single problem across multiple processors


Distributing the Data

• Decision on how to distribute the data should consider these issues:– Load balancing:

Often implies an equal distribution of data, but more generally means an equal distribution of work

– Communications:Want to minimize the impact of communications, taking into account both size and number of messages

– Physics:Choice of distribution will depend on the processes that are being modeled in each direction.


A good distribution if the physics of theproblem is the same in both directions.Minimizes the amount of data that mustbe communicated between processors.

If expensive global operations need to becarried out in the x-direction (ex. FFTs), this is probably a better choice.

A Data Distribution Example


Imagine that we are doing a simulationin which more work is required for thegrid points covering the shaded object.

Neither data distribution from the previous example will result in good load balancing.

May need to consider an irregular gridor a different data structure.

A More Difficult Example


Choosing a Resource

• The following factors should be taken into account when choosing a resource:– What is the granularity of my code?– Are there any special hardware features that I

need or can take advantage of?– How many processors will the code be run on?– What are my memory requirements?

• By carefully considering these points, you can make the right choice of computational platform.


Granularity is a measure of the amount of work done by each processor between synchronization events.

PE 0PE 1

Low-granularity application

PE 0PE 1

High-granularity application

Generally, latency is the critical parameter for low-granularity codes, while processor performance is the key factor for high-granularity applications.

Choosing a Resource: Granularity


Choosing a Resource: Special Hardware Features

• Various HPC platforms have different hardware features that your code may be able to take advantage of. Examples include:– Hardware support for divide and square root

operations (IBM SP)– Parallel I/O file system (IBM SP)– Data streams (CRAY T3E)– Control over cache alignment (CRAY T3E)– E-registers for by-passing cache hierarchy

(CRAY T3E)


Importance of Parallel Computing

• High performance computing has become almost synonymous with parallel computing.

• Parallel computing is necessary to solve big problems (high resolution, lots of timesteps, etc.) in science and engineering.

• Developing and maintaining efficient, scalable parallel applications is difficult. However, the payoff can be tremendous.


Importance of Parallel Computing

• Before jumping in, think about– whether or not your code truly needs to be

parallelized– how to decompose your problem.

• Then choose a programming model based on your problem and your available architecture.

• Take advantage of the resources that are available - compilers libraries, debuggers, performance analyzers, etc. - to help you write efficient parallel code.


Useful References

• Hennessy, J. L. and Patterson, D. A. Computer Architecture: A Quantitative Approach.

• Patterson, D.A. and Hennessy, J.L., Computer Organization and Design: The Hardware/Software Interface.

• D. Dowd, High Performance Computing.

• D. Kuck, High Performance Computing. Oxford U. Press (New York) 1996.

• D. Culler and J. P. Singh, Parallel Computer Architecture.


Outline

• Preface






Distributed Computing

• Concept has been used for two decades

• Basic idea: run scheduler across systems to runs processes on least-used systems first– Maximize utilization– Minimize turnaround time

• Have to load executables and input files to selected resource– Shared file system– File transfers upon resource selection


Examples of Distributed Computing

• Workstation farms, Condor flocks, etc.– Generally share file system

• SETI@home, Entropia, etc.– Only one source code; central server copies

correct binary code and input data to each system

• Napster, Gnutella: file/data sharing

• NetSolve– Runs numerical kernel on any of multiple

independent systems, much like a Grid solution


SETI@home: Global Distributed Computing

• Running on 500,000 PCs, ~1000 CPU Years per Day– 485,821 CPU Years so far

• Sophisticated Data & Signal Processing Analysis

• Distributes Datasets from Arecibo Radio Telescope


Distributed vs. Parallel Computing

• Different– Distributed computing executes independent (but

possibly related) applications on different systems; jobs do not communicate with each other

– Parallel computing executes a single application across processors, distributing the work and/or data but allowing communication between processes

• Non-exclusive: can distribute parallel applications to parallel computing systems


Grid Computing

• Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals—in the absence of central control, omniscience, trust relationships.

• Resources (HPC systems, visualization systems & displays, storage systems, sensors, instruments, people) are integrated via ‘middleware’ to facilitate use of all resources.


Why Grids?

• Resources have different functions, but multiple classes resources are necessary for most interesting problems.

• Power of any single resource is small compared to aggregations of resources

• Network connectivity is increasing rapidly in bandwidth and availability

• Large problems require teamwork and computation


Network Bandwidth Growth

• Network vs. computer performance– Computer speed doubles every 18 months– Network speed doubles every 9 months– Difference = order of magnitude per 5 years

• 1986 to 2000– Computers: x 500– Networks: x 340,000

• 2001 to 2010– Computers: x 60– Networks: x 4000

Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.


Grid Possibilities

• A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour

• 1,000 physicists worldwide pool resources for petaflop analyses of petabytes of data

• Civil engineers collaborate to design, execute, & analyze shake table experiments

• Climate scientists visualize, annotate, & analyze terabyte simulation datasets

• An emergency response team couples real time data, weather model, population data


Some Grid Usage Models

• Distributed computing: job scheduling on Grid resources with secure, automated data transfer

• Workflow: synchronized scheduling and automated data transfer from one system to next in pipeline (e.g. HPC system to visualization lab to storage system)

• Coupled codes, with pieces running on different systems simultaneously

• Meta-applications: parallel apps spanning multiple systems


Grid Usage Models

• Some models are similar to models already being used, but are much simpler due to:– single sign-on– automatic process scheduling– automated data transfers

• But Grids can encompass new resources likes sensors and instruments, so new usage models will arise


Selected Major Grid Projects

Name URL & Sponsors FocusAccess Grid www.mcs.anl.gov/FL/

accessgrid; DOE, NSFCreate & deploy group collaboration systems using commodity technologies

BlueGrid IBM Grid testbed linking IBM laboratories

DISCOM www.cs.sandia.gov/discomDOE Defense Programs

Create operational Grid providing access to resources at three U.S. DOE weapons laboratories

DOE Science Grid

sciencegrid.org

DOE Office of Science

Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities

Earth System Grid (ESG)

earthsystemgrid.orgDOE Office of Science

Delivery and analysis of large climate model datasets for the climate research community

European Union (EU) DataGrid

eu-datagrid.org

European Union

Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics

g

g

g

g

g

g



Name URL/Sponsor FocusEuroGrid, Grid Interoperability (GRIP)

eurogrid.org

European Union

Create technologies for remote access to supercomputer resources & simulation codes; in GRIP, integrate with Globus

Fusion Collaboratory fusiongrid.org

DOE Off. Science

Create a national computational collaboratory for fusion research

Globus Project globus.org

DARPA, DOE, NSF, NASA, Msoft

Research on Grid technologies; development and support of Globus Toolkit; application and deployment

GridLab gridlab.org

European Union

Grid technologies and applications

GridPP gridpp.ac.uk

U.K. eScience

Create & apply an operational grid within the U.K. for particle physics research

Grid Research Integration Dev. & Support Center

grids-center.org

NSF

Integration, deployment, support of the NSF Middleware Infrastructure for research & education

g

g

g

g

g

g



Name URL/Sponsor FocusGrid Application Dev. Software

hipersoft.rice.edu/grads; NSF

Research into program development technologies for Grid applications

Grid Physics Network griphyn.org

NSF

Technology R&D for data analysis in physics expts: ATLAS, CMS, LIGO, SDSS

Information Power Grid

ipg.nasa.gov

NASA

Create and apply a production Grid for aerosciences and other NASA missions

International Virtual Data Grid Laboratory

ivdgl.org

NSF

Create international Data Grid to enable large-scale experimentation on Grid technologies & applications

Network for Earthquake Eng. Simulation Grid

neesgrid.org

NSF

Create and apply a production Grid for earthquake engineering

Particle Physics Data Grid

ppdg.net

DOE Science

Create and apply production Grids for data analysis in high energy and nuclear physics experiments

g

g

g

g

g

g



Name URL/Sponsor FocusTeraGrid teragrid.org

NSF

U.S. science infrastructure linking four major resource sites at 40 Gb/s

UK Grid Support Center

grid-support.ac.uk

U.K. eScience

Support center for Grid projects within the U.K.

Unicore BMBFT Technologies for remote access to supercomputers

g

g

New

There are also many technology R&D projects: e.g., Globus, Condor, NetSolve, Ninf, NWS, etc.


Example Application Projects

• Earth Systems Grid: environment (US DOE)

• EU DataGrid: physics, environment, etc. (EU)

• EuroGrid: various (EU)

• Fusion Collaboratory (US DOE)

• GridLab: astrophysics, etc. (EU)

• Grid Physics Network (US NSF)

• MetaNEOS: numerical optimization (US NSF)

• NEESgrid: civil engineering (US NSF)

• Particle Physics Data Grid (US DOE)


Some Grid Requirements – Systems/Deployment Perspective

• Identity & authentication

• Authorization & policy

• Resource discovery

• Resource characterization

• Resource allocation

• (Co-)reservation, workflow

• Distributed algorithms

• Remote data access

• High-speed data transfer

• Performance guarantees

• Monitoring

• Adaptation

• Intrusion detection

• Resource management

• Accounting & payment

• Fault management

• System evolution

• Etc.

• Etc.


Some Grid Requirements –User Perspective

• Single allocation (or none needed)

• Single sign-on: authentication to any Grid resources authenticates for all others

• Single compute space: one scheduler for all Grid resources

• Single data space: can address files and data from any Grid resources

• Single development environment: Grid tools and libraries that work on all grid resources


The Systems Challenges:Resource Sharing Mechanisms That…

• Address security and policy concerns of resource owners and users

• Are flexible enough to deal with many resource types and sharing modalities

• Scale to large number of resources, many participants, many program components

• Operate efficiently when dealing with large amounts of data & computation


The Security Problem

• Resources being used may be extremely valuable & the problems being solved extremely sensitive

• Resources are often located in distinct administrative domains– Each resource may have own policies & procedures

• The set of resources used by a single computation may be large, dynamic, and/or unpredictable– Not just client/server

• It must be broadly available & applicable– Standard, well-tested, well-understood protocols– Integration with wide variety of tools


The Resource Management Problem

• Enabling secure, controlled remote access to computational resources and management of remote computation– Authentication and authorization– Resource discovery & characterization– Reservation and allocation– Computation monitoring and control


Grid Systems Technologies

• Systems and security problems addressed by new protocols & services. E.g., Globus:– Grid Security Infrastructure (GSI) for security– Globus Metadata Directory Service (MDS) for

discovery– Globus Resource Allocations Manager (GRAM)

protocol as a basic building block• Resource brokering & co-allocation services

– GridFTP for data movement


The Programming Problem

• How does a user develop robust, secure, long-lived applications for dynamic, heterogeneous, Grids?

• Presumably need:– Abstractions and models to add to

speed/robustness/etc. of development– Tools to ease application development and

diagnose common problems– Code/tool sharing to allow reuse of code

components developed by others


Grid Programming Technologies

• “Grid applications” are incredibly diverse (data, collaboration, computing, sensors, …)– Seems unlikely there is one solution

• Most applications have been written “from scratch,” with or without Grid services

• Application-specific libraries have been shown to provide significant benefits

• No new language, programming model, etc., has yet emerged that transforms things– But certainly still quite possible


Examples of GridProgramming Technologies

• MPICH-G2: Grid-enabled message passing

• CoG Kits, GridPort: Portal construction, based on N-tier architectures

• GDMP, Data Grid Tools, SRB: replica management, collection management

• Condor-G: simple workflow management

• Legion: object models for Grid computing

• Cactus: Grid-aware numerical solver framework– Note tremendous variety, application focus


MPICH-G2: A Grid-Enabled MPI

• A complete implementation of the Message Passing Interface (MPI) for heterogeneous, wide area environments– Based on the Argonne MPICH implementation of MPI

(Gropp and Lusk)

• Globus services for authentication, resource allocation, executable staging, output, etc.

• Programs run in wide area without change!

• See also: MetaMPI, PACX, STAMPI, MAGPIE

www.globus.org/mpi


Grid Events

• Global Grid Forum: working meeting– Meets 3 times/year, alternates U.S.-Europe, with

July meeting as major event

• HPDC: major academic conference– HPDC-11 in Scotland with GGF-8, July 2002

• Other meetings include– IPDPS, CCGrid, EuroGlobus, Globus Retreats

www.gridforum.org, www.hpdc.org


Useful References

• Book (Morgan Kaufman)– www.mkp.com/grids

• Perspective on Grids– “The Anatomy of the Grid: Enabling Scalable

Virtual Organizations”, IJSA, 2001– www.globus.org/research/papers/anatomy.pdf

• All URLs in this section of the presentation, especially:– www.gridforum.org, www.grids-center.org,

www.globus.org


Outline

• Preface






Value of Understanding Future Trends

• Monitoring and understanding future trends in HPC is important:– users: applications should be written to be

efficient on current and future architectures– developers: tools should be written to be efficient

on current and future architectures– computing centers: system purchases are

expensive and should have upgrade paths


The Next Decade

• 1980s and 1990s:– academic and government requirements strongly

influenced parallel computing architectures– academic influence was greatest in developing

parallel computing software (for science & eng.)– commercial influence grew steadily in late 1990s

• In the next decade:– commercialization will become dominant in

determining the architecture of systems– academic/research innovations will continue to

drive the development of the HPC software


Commercialization

• Computing technologies (including HPC) are now propelled by profits, not sustained by subsidies– Web servers, databases, transaction processing

and especially multimedia applications drive the need for computational performance.

– Most HPC systems are ‘scaled up’ commercial systems, with less additional hardware and software compared to commercial systems.

– It’s not engineering, it’s economics.


Processors and Nodes

• Easy predictions:– microprocessors performance increase continues

at ~60% per year (Moore’s Law) for 5+ years.– total migration to 64-bit microprocessors– use of even more cache, more memory hierarchy.– increased emphasis on SMPs

• Tougher predictions:– resurgence of vectors in microprocessors? Maybe– dawn of multithreading in microprocessors? Yes


Building Fat Nodes: SMPs

• More processors are faster, of course– SMPs are simplest form of parallel systems– efficient if not limited by memory bus contention:

small numbers of processors

• Commercial market for high performance servers at low cost drives need for SMPs

• HPC market for highest performance, ease of programming drives development of SMPs


Building Fat Nodes: SMPs

• Trends are to:– build bigger SMPs– attempt to share memory across SMPs (cc-

NUMA)


Resurgence of Vectors

• Vectors keep functional units busy– vector registers are very fast– vectors are more efficient for loops of any stride– vectors are great for many science & eng. apps

• Possible resurgence of vectors– SGI/Cray plans has built SV1ex, building SV2– NEC continues building (CMOS) parallel-vector,

Cray-like systems– Microprocessors (Pentium4, G4) have added

vector-like functionality for multimedia purposes


Dawn of Multithreading?

• Memory speed will always be a bottleneck

• Must overlap computation with memory accesses: tolerate latency– requires immense amount of parallelism– requires processors with multiple streams and

compilers that can define multiple threads


Multithreading Diagram


Multithreading

• Tera MTA was first multithreaded HPC system– scientific success, production failure– MTA-2 will be delivered in a few months.

• Multithreading will be implemented (in more limited fashion) in commercial processors.


Networks

• Commercial network bandwidth and latency approaching custom performance.

• Dramatic performance increases likely– “the network is the computer” (Sun slogan)– more companies, more competition– no severe physical, economic limits

• Implications of faster networks– more clusters– collaborative, visual supercomputing– Grid computing


Commodity Clusters

• Clusters provide some real advantages:– computing power: leverage workstations and PCs– high availability: replace one at a time– inexpensive: leverage existing competitive market– simple path to installing parallel computing system

• Major disadvantages were robustness of hardware and software, but both have improved

• NCSA has huge clusters in production based on Pentium III and Itanium.


Clustering SMPs

• Inevitable (already here!):– leverages SMP nodes effectively for same

reasons clusters leverage individual processors– Commercial markets drive need for SMPs

• Combine advantages of SMPs, clusters– more powerful nodes through multiprocessing– more powerful nodes -> more powerful cluster– Interconnect scalability requirements reduced for

same number of processors


Continued Linux Growth in HPC

• Linux popularity growing due to price and availability of source code

• Major players now supporting Linux, esp. IBM

• Head start on Intel Itanium


Programming Tools

• However, programming tools will continue to lag behind hardware and OS capabilities:– Researchers will continue to drive the need for the

most powerful tools to create the most efficient applications on the largest systems

– Such technologies will look more like MPI than the Web… maybe worse due to multi-tiered clusters of SMPs (MPI + OpenMP; Active messages + threads?).

– Academia will continue to play a large role in HPC software development.


Grid Computing

• Parallelism will continue to grow in the form of– SMPs– clusters– Cluster of SMPs (and maybe DSMs)

• Grids provide the next level– connects multiple computers into virtual systems– Already here:

• IBM, other vendors supporting Globus• SC2001 dominated by Grid technologies• Many major government awards (>$100M in past year)


Emergence of Grids

• But Grids enable much more than apps running on multiple computers (which can be achieved with MPI alone)– virtual operating system: provides global

workspace/address space via a single login– automatically manages files, data, accounts, and

security issues– connects other resources (archival data facilities,

instruments, devices) and people (collaborative environments)


Grids Are Inevitable

• Inevitable (at least in HPC):– leverages computational power of all available

systems– manages resources as a single system--easier for

users– provides most flexible resource selection and

management, load sharing– researchers’ desire to solve bigger problems will

always outpace performance increases of single systems; just as multiple processors are needed, ‘multiple multiprocessors’ will be deemed so


Grid-Enabled Software

• Commercial applications on single parallel systems and Grids will require that:– underlying architectures must be invisible: no

parallel computing expertise required– usage must be simple– development must not be to difficult

• Developments in ease-of-use will benefit scientists as users (not as developers)

• Web-based interfaces: transparent supercomputing (MPIRE, Meta-MEME, etc.).


Grid-Enabled Collaborative andVisual Supercomputing

• Commercial world demands:– multimedia applications– real-time data processing– online transaction processing– rapid prototyping and simulation in engineering,

chemistry and biology– interactive, remote collaboration– 3D graphics, animation and virtual reality

visualization


Grid-enabled Collaborative, Visual Supercomputing

• Academic world will leverage resulting Grids linking computing and visualization systems via high-speed networks:– collaborative post-processing of data already here– simulations will be visualized in 3D, virtual worlds

in real-time– such simulations can then be ‘steered’– multiple scientists can participate in these visual

simulations– the ‘time to insight’ (SGI slogan) will be reduced


Web-based Grid Computing

• Web currently used mostly for content delivery

• Web servers on HPC systems can execute applications

• Web servers on Grids can launch applications, move/store/retrieve data, display visualizations, etc.

• NPACI HotPage already enables single sign-on to NPACI Grid Resources


Summary of Expectations

• HPC systems will grow in performance but probably change little in design (5-10 years):– HPC systems will be larger versions of smaller

commercial systems, mostly large SMPs and clusters of inexpensive nodes

– Some processors will exploit vectors, as well as more/larger caches.

– Best HPC systems will have been designed ‘top-down’ instead of ‘bottom-up’, but all will have been designed to make the ‘bottom’ profitable.

– Multithreading is the only likely, near-term major architectural change.


Summary of Expectations

• Using HPC systems will change much more:– Grid computing will become widespread in HPC

and in commercial computing– Visual supercomputing and collaborative

simulation will be commonplace.– WWW interfaces to HPC resources will make

transparent supercomputing commonplace.

• But programming the most powerful resources most effectively will remain difficult.


Caution

• Change is difficult to predict (and I am an astrophysicist, not an astrologer):– Accuracy of linear extrapolation predictions

degrade over long times (like weather forecasts)– Entirely new ideas can change everything:

• WWW is an excellent example; Grid computing is probably the next

• Eventually, something truly different will replace CMOS technology (nanotechnology? molecular computing? DNA computing?)


Final Prediction

“The thing about change is that things will be different afterwards.”

Alan McMahon (Cornell University)

Documents

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More Dr. Jay Boisseau Director, Texas Advanced