CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists...

Preview:

Citation preview

1 of 28

2

ming for

© 2006 David A. Padua

CS420/CSE 402/ECE 49

Introduction to Parallel ProgramScientists and Engineers

Spring 2006

2 of 28

nization

© 2006 David A. Padua

Additional Foils 0.i: Course orga

3 of 28

Instructor: Office Hours:

© 2006 David A. Padua

David Padua. By appointment4227 SC padua@uiuc.edu 3-4223

T.A.: Office Hours:Predrag Tosic

XXXX Siebel Center p-tosic@uiuc.eduX-XXXX

4 of 28

© 2006 David A. Padua

Textbook

5 of 28

Lectures

ill be posted at:

0/

must complete ork).

© 2006 David A. Padua

Some lecture foils will be required reading. These w

http://www-courses.cs.uiuc.edu/~cs42

Grading:

6-9 Machine Problems(MPs)/Homeworks 50% Midterm (Friday Mar 3) 25%Final (Comprehensive) 25%

Graduate students registered for 1 unit (4 credits)additional work (associated with each MP/Homew

6 of 28

cs

© 2006 David A. Padua

Additional Foils 0.ii: Topi

7 of 28

, pC++, SplitC, n), HTA

© 2006 David A. Padua

• Machine models.

• Parallel programming models.

• Language extensions to express parallelism:

OpenMP (Fortran) and MPI (Fortran or C).

If time allows: High-Performance Fortran, LindaUPC (Unified Parallel C), CAF (Co-array Fortra(Hierarchically Tiled Aarrays).

• Issues in algorithm design

Parallelism

Load balancing

Communication

Locality

8 of 28

• Algorithms.

ication and

© 2006 David A. Padua

Linear algebra algorithms such as matrix multiplequation solvers.

Symbolic algorithms such as sorting.

N-body

Random number generators.

Asynchronous algorithms.

• Program analysis and transformation

Dependence analysis

Race conditions

Deadlock detection

• Parallel program development and maintenance

Modularity

Performance analysis and tuning

Debugging

9 of 28

duction

© 2006 David A. Padua

Additional Foils Chapter 1: Intro

10 of 28

P

• ming two or more

• ce the very first

• ntional systems such

andle one digit at a This design strategy al computer design of

ructions and floating-ctions can execute

ed simultaneously. t.

© 2006 David A. Padua

arallelism

The idea is simple: improve performance by perforoperations at the same time.

Has been an important computer design strategy sincomputers.

It takes many (complementary forms) within conveas uniprocessor PCs and UNIX workstations:

At the circuit level: Adders and multipliers do not htime but operate on several digits at the same time. was used even by Charles Babbage in his mechanicthe 19th century.

At the processor-design level: The execution of instpoint operations is usually pipelined. Several instrusimultaneously.

At the system level: Computation and I/O can proceThis is why multiprogramming increases throughpu

11 of 28

• However, the design strategy of interest to us is to attain everal complete

eading /products/server/

E|s )

med after a 1964 s that “The its doubles every

f parallel systems

doorstop can be ather one of a res, which can one chooses to the dear departed he others.” In

© 2006 David A. Padua

parallelism by using several processors or even scomputers.

• Future PCs will be built with multicore chips. (Rassignment: http://www.intel.com/business/bssresource_center/multi-core.htm?ppc_cid=ggl|multicore_resrc_ctr|k46E

• Multicore are made possible by Moore’s Law, naobservation by Gordon E. Moore of Intel. It holdnumber of elements in advanced integrated circuyear.”

• Another important reason for the development oof the multicomputer variety is availability.

“Having a computer shrivel up into an expensivea whole lot less traumatic if it’s not unique, but rherd. The herd should be able to accomodate spapotentially be used to keep the work going; or ifconfigure sparelessly, the work that was done bysibling can, potentially, be redistributed among tsearch of clusters. G. Pfister. Prentice Hall.

12 of 28

13 of 28

n used for as the weather, turing processes,

ized as the third some cases it is the only may not be possible due (very far away), dangers g the experiments. In d into computer software include both

mathematical models. By eter values, an

hese simulations and the anding and its usefulness

g and re”.

© 2006 David A. Padua

Applications

• Traditionally, highly parallel computers have beenumerical simulations of complex systems such mechanical devices, electronic circuits, manufacchemical reactions, etc.

• “ In part because of HPCC technologies, simulation has become recognparadigm of science, the first two being experimentation and theory. Inapproach available for further advancing knowledge -- experimentsto size (very big or very small), speed (very fast or very slow), distanceto health and safety (toxic or explosive), or the economics of conductinsimulations, mathematical models of physical phenomena are translatethat specifies how calculations are performed using input data that mayexperimental data and estimated values of unknown parameters in the repeatedly running the software using different data and different paramunderstanding of the phenomenon of interest emerges. The realism of tspeed with which they are produced affect the accuracy of this understin predicting change. “

From an old document entitled “High Performance ComputinCommunications: Foundation for America's Information Futu

14 of 28

in parallel

ASC)

s/BGLbrocure.pdf

ar weapons in e of the US ters (1000s of ing used to

y. Examples ers, data mining, main driving

tions due to their

© 2006 David A. Padua

• Perhaps the most important government programcomputing today is the

Advanced Simulation and Computing Program (

( Reading assignment:

http://www.llnl.gov/asc/overview/overview.html http://www.llnl.gov/asci/platforms/bluegenel/image

).

Its main objective is to accurately simulate nucleorder to verify safety, reliability, and performancnuclear stockpile. Several highly-parallel compuprocessors) from Intel, IBM, and SGI are now bedevelop these simulations

• Commercial applications are also important todainclude: transaction processing systems, web servetc. These applications will probably become theforce behind parallel computing in the future.

• In this course, we will focus on numerical simulaimportance for scientists and engineers.

15 of 28

sidered today as a experimentation

ring tool that lity of new

© 2006 David A. Padua

• As mentioned above, computer simulation is conthird mode of scientific research. It complementsand theoretical analysis.

• Furthermore, simulation is an important engineeprovides fast feedback on the quality and feasibidesigns.

16 of 28

ne models

© 2006 David A. Padua

Additional Foils Chapter 2: Machi

17 of 28

l model

Parallel

s ago.

this model. It is

© 2006 David A. Padua

2.1 The Von Neumann computationa

Discussion taken from Almasi and Gottlieb: HighlyComputing. Benjamin Cummings, 1988.

• Designed by John Von Neumann about fifty year

• All widely used “conventional” machines followrepresented next:

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

18 of 28

s “add the ult in that

d data of a

fter another from d shuttles data essor.

© 2006 David A. Padua

• The machine’s essential features are:

1. A processor that performs instructions such acontents of these two registers and put the resregister”

2. A memory that stores both the instructions anprogram in cells having unique addresses.

3. A control scheme that fetches one instruction athe memory for execution by the processor, anone word at a time between memory and proc

19 of 28

For an instruction to be executed, there are several steps that must be performed. For example:

1. Instruction Fetch and decode (IF). Bring the instrution from memory into the control unit and identify the type of instruction.

2. Read data (RD). Read data from memory.

3. Execution (EX). Execute operation.

4. Write Back (WB). Write the results back.

20 of 28

ed in a high

mpiler into the le, the previous sequence of the

register 3)

s in memory)

ine” with its own

nguages, such as model.

© 2006 David A. Padua

• Notice that machines today usually are programmlevel language containing statements such as

A = B + C

However, these statements are translated by a comachine instructions just mentioned. For exampassignment statement would be translated into a form:

LD 1,B (load B from memory into processor register 1)

LD 2,C (load C from memory into register 2)

ADD 3,1,2 (add registers 1 and 2 and put the result into

ST 3,A (store register 3’s contents into variable A’s addres

• It is said that the compiler creates a “virtual machlanguage and computational model.

• Virtual machines represented by conventional laFortran 77 and C, also follow the Von Neumann

21 of 28

tion of

mmunicate with

© 2006 David A. Padua

2.2 Multicomputers

• The easiest way to get parallelism given a collecconventional computers is to connect them:

• Each machine can proceed independently and cothe others via the interconnection network.

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

Interconnect

22 of 28

lusters and uite similar, but old as such.

interconnected le, unified

essor (such as

ervers

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

© 2006 David A. Padua

• There are two main classes of multicomputers: cdistributed-memory multiprocessors. They are qthe latter is considered a single computer and is s

Furthermore, a cluster consists of a collection ofwhole computers (including I/O) used as a singcomputing resource.

Not all nodes of a distributed memory multiprocIBMs SP-2) need have complete I/O resources.

• An example of cluster is a web server

The net

dispatcherrouter

request

S

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

23 of 28

rmilab, which workstations.

. Analyzing any zing any of the that analyzes one ossible to analyze

© 2006 David A. Padua

• Another example was a workstation cluster at Feconsisted of about 400 Silicon Graphics and IBMThe system is used to analyze accelerator eventsone of those events has nothing to do with analyothers. Each machine runs a sequential program event at a time. By using several machines it is pmany events simultaneously.

24 of 28

cessor is the we mean that ties. Therefore al access to every O device equally h processor the m symmetric.

hese will be

I/O

LAN Disks

Interconnect

© 2006 David A. Padua

2.3 Shared-memory multiprocessors

• The simplest form of a shared-memory multiprosymmetric multiprocessor (SMP). By symmetriceach of the processors has exactly the same abiliany processor can do anything: they all have equlocation in memory; they all can control every I/well, etc. In effect, from the point of view of eacrest of the machine looks the same, hence the ter

• An important component of SMPs are caches. Tdiscussed later.

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

25 of 28

allelism that are e coarse grain rs.

nit.

is type of

© 2006 David A. Padua

2.4 Other forms of parallelism

• As discussed above, there are other forms of parwidely used today. These usually coexist with thparallelism of multicomputers and multiprocesso

• Pipelining of the control unit and/or arithmetic u

• Multiple functional units

• Most microprocessors today take advantage of thparallelism.

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

registers

Instruction counter

CONTROL

26 of 28

are an important hat each re performed ted by the guage rol this type of

IALU

BRANCH

© 2006 David A. Padua

• VLIW (Very Long Instruction Word) processorsclass of multifunctional processors. The idea is tinstruction may involve several operations that asimultaneously.This parallelism is usually exploicompiler and not accessible to the high-level lanprogrammer. However, the programmer can contparallelism in assembly language.

Register File

Memory

LD/ST FADD FMUL

LD/ST FADD FMUL IALUInstruction

Word

Multifunction Processor (VLIW)

27 of 28

achine. Each was connected to us).

MEMORYholds instructions and data

ARITHMETICUNIT

logic

registers

© 2006 David A. Padua

• Array processors. Multiple arithmetic units

• Illiac IV is the earliest example of this type of marithmetic unit (processing unit) of the Illiac IV four others to form a two-dimensional array (tor

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

MEMORYholds instructions and data

ARITHMETICUNIT

logic

registers

ARITHMETICUNIT

logic

registers

28 of 28

h he picked two ssible nd the others

nal Von

lticomputers and

processors.

ed and perhaps

© 2006 David A. Padua

2.5 Flynn’s taxonomy

• Michael Flynn published a paper in 1972 in whiccharacteristics of computers and tried all four pocombinations. Two stuck in everybody’s mind, adidn’t:

• SISD: Single Instruction, Single Data. ConventioNeumann computers.

• MIMD: Multiple Instruction, Multiple Data. Mumultiprocessors.

• SIMD: Single Instruction, Multiple Data. Array

• MISD: Multiple Instruction, Single Data. Not usnot meaningful.

Recommended