Introduction Prof. Sivarama Dandamudi School of Computer Science Carleton University

Introduction

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Why Parallel Systems?

Increased execution speedMain motivation for many applications

Improved fault-tolerance and reliabilityMultiple resources provide improved FT and reliability

ExpandabilityProblem scaleupNew applications


Three Metrics

SpeedupProblem size is fixed

Adding more processors should reduce time

Speedup on n processors S(n) is

Time on 1 processor system

Time on n-processor systemLinear speedup if S(n) = n for 0 1

Perfectly linear speedup if = 1


Three Metrics (cont’d)

ScaleupProblem size increases with system sizeScaleup on n processors C(n) is

Small problem time on 1-processor system

Larger problem time on n-processor systemLinear scaleup

C(n) = for 0 < 1Perfectly linear scaleup if = 1



EfficiencyDefined as the average utilization of n processorsEfficiency of n processors E(n) is related to speedup

E(n) = S(n)

If efficiency remains 1 as we add more processorsWe can get perfectly linear speedups

n



S(n)

Number of processors, n

Linear speeup

Sublinear speedup

Perfectly linear speedup



C(n)

Problem size & Processors

linear scaleup

sublinear scaleup


Example 1: QCD Problem Quantum chromodynamics (QCD)

Predicts the mass of proton…Requires approximately 3*1017 operations

On Cray-1 like system with 100 MflopsTakes about 100 yearsStill takes 10 years on a Pentium that is 10 times faster

IBM built a special system (GF-11)Provides about 11 Gflops (peak—sustained about 7 Gflops)QCD problem takes only a year or so!


Example 2: Factoring of RSA-129 Factoring of a 129-digit number (RSA-129) into two

primes RSA stands for

Ronald Rivest of MIT Adi Shamir of the Weizmann Institute of Science, Israel, Leonard Adleman of USC

In 1977 they announced a new cryptographic scheme Known as the RSA public-key system

``Cryptography is a never-ending struggle between code makers and code breakers.'' Adi Shamir


Example 2: Factoring of RSA-129 (cont’d)

RSA-129 = 1143816257578888676692357799761466120102182

9672124236256256184293570693524573389783059 7123563958705058989075147599290026879543541

The two primes are34905295108476509491478496199038 98133417764638493387843990820577 x32769132993266709549961988190834 461413177642967992942539798288533

Solved in April 1994 (challenge posted in September 1993) Needs over 1017 operations 0.1% of Internet was used 100% of Internet would have solved the problem in 3 hours


Example 3: Toy Story

The production of the Toy Story140,000 frames to render for the movieRequires about 1017 operations

Same as the RSA-129 problem

Used dozens of SUN workstationsEach SUN about 10 MIPS


Example 4: N-Body Problem

Simulates the motion of N particles under the influence of mutual force fields based on an inverse square lawMaterial science, astrophysics, etc. all require a variant

of the N-body problem To double the physical accuracy seems require

four times the computation Implications for scaleup


Peak SpeedsMachine Mflops

Cray-1 160

Cray C90 1000

Cray T90 2200

Pentium 4 (3070 MHz) 3070

Athelon XP 1900+ 3200

Itanium 6400

Sony Playstation 2 6300


Applications of Parallel Systems A wide variety

Scientific applicationsEngineering applicationsDatabase applicationsArtificial intelligenceReal-time applicationsSpeech recognition Image processing


Applications of Parallel Systems (cont’d)

Scientific applications Weather forecasting QCD Blood flow in the heart Molecular dynamics Evolution of galaxies

Most problems rely on basic operations of linear algebra Solving linear equations Finding Eigen values



Weather forecastingNeeds to solve general circulation model equationsComputation is carried out in

3-dimensional grid that partitions the atmosphereA fourth dimension is added: time

Number of time steps in the simulation

With a grid with 270 miles, 24-hour forecast needs 100 billion data operations



Weather forecasting (cont’d)

On a 100-Mflops processorA 24-hour forecast takes about 1.5 hours

Want more accuracy?Use half the grid size, halve the time step

Involves 24 = 16 times more processingOn a 100 Mflops processor, 24-hour forecast takes 24 hours!

To complete in 1.5 hoursWe need 16 times faster system



Engineering applications (VLSI design)Circuit simulation

Detailed simulation of electrical characteristics

PlacementAutomatic positioning of blocks on a chip

WiringAutomated placement of wires to form desired connection

Done after the previous placement step



Artificial IntelligenceProduction systems have three components

Working memory Stores global database

About various facts or data about the modeling worldProduction memory

Stores knowledge base A set of production rules

Control system Chooses which rule should be applied



Artificial Intelligence (cont’d)

ExampleWorking memory

Production memory

f(curt,elaine) m(elaine,john)f(dan,pat) m(marian,elaine)f(pat,john) m(peg,dan)f(sam,larry) m(peg,doug)f(larry,dan)f(larry,doug)

1. gf(X,Z) f(X,Y), f(Y,Z)2. gf(X,Z) f(X,Y), m(Y,Z)



Query: A grandchild of Sam



Artificial Intelligence (cont’d)

Sources of parallelismAssign each production rule its own processor

Each can search the working memory for pertinent facts in parallel with all the other processors

AND-parallelism Synchronization is involved

OR-parallelism Abort other searches if one is successful



Database applicationsRelational model

Uses tables to store dataThree basic operations

Selection Selects tuples that satisfy a specified condition

Projection Selects certain specified columns

Join Combines data from two tables



Database applications (cont’d)

Sources of parallelismWithin a single query (intra-query parallelism)

Horizontally partition relations into P fragments Each processor independently works on each segment

Among queries (inter-query parallelism) Execute several queries concurrently

Exploit common subqueries Improves query throughput


Flynn’s Taxonomy

Based on number of instruction and data streamsSingle-Instruction, Single-Data stream (SISD)

Uniprocessor systems

Single-Instruction, Multiple-Data stream (SIMD)Array processors

Multiple-Instruction, Single-Data stream (MISD)Not really useful

Multiple-Instruction, Multiple-Data stream (MIMD)


Flynn’s Taxonomy

MIMD systemsMost popular category

Shared-memory systemsAlso called multiprocessors

Sometimes called tightly-coupled systems

Distributed-memory systemsAlso called multicomputers

Sometimes called loosely-coupled systems


Another Taxonomy Parallel systems

SynchronousVectorArraySIMDSystolic

AsynchronousMIMDDataflow


SIMD Architecture Multiple actors, single script SIMD comes in two flavours

Array processors Large number of simple processors

Operate on small amount of data (bits, bytes, words,…) Illiac IV, Burroughs BSP, Connection Machine CM-1

Vector processors Small number (< 32) of powerful, pipelined processors

Operate on large amount of data (vectors) Cray 1 (1976), Cray X/MP (mid 1980s, 4 processors), Cray Y/MP (1988,

8 processors), Cray 3 (1989, 16 processors)

29 © S. Dandamudi Carleton University

SIMD Architecture (cont’d)


Shared-Memory MIMD Two major classes

UMAUniform memory accessTypically bus-basedLimited to small size systems

NUMANon-uniform memory accessUse a MIN-based interconnectionExpandable to medium system sizes


Shared-Memory MIMD (cont’d)



UMA



NUMA



ExamplesSGI Power OnyxCray C90 IBM SP2 Node

Symmetric Multi-Processing (SMP)Special case of shared-memory MIMD Identical processors share the memory


Distributed-Memory MIMD

Typically use message-passing Interconnection network is static

Point-to-point network

System scales up to thousands of nodes Intel TFLOPS system consists of 9000+ processors

Similar to cluster systemsPopular architecture for large parallel systems


Distributed-Memory MIMD (cont’d)


Distributed-Memory MIMD (cont’d)


Hybrid Systems

Sanford DASH


Distributed Shared Memory

Advantages of shared-memory MIMDRelatively easy to program

Global shared memory view

Fast communication & data sharingVia the shared memoryNo physical copying of data

Load distribution is not a problem


Distributed Shared Memory (cont’d)

Disadvantages of shared-memory MIMDLimited scalability

UMA can scale to 10s of processorsNUMA can scale to 100s of processors

Expensive network



Advantages of distributed-memory MIMDGood scalability

Can scale to 1000s of processors

Inexpensive network (relatively speaking)Uses static interconnection

Cheaper to buildCan use off-the-shelf components



Disadvantages of distributed-memory MIMDNot easy to program

Deal with explicit message-passing

Slow networkExpensive data copying

Done by message passing

Load distribution is an issue



DSM is proposed to take advantage of these two types of systemsUses distributed-memory MIMD hardwareA software layer gives the appearance of shared-

memory to the programmerA memory read, for example, is transparently converted to a

message send and reply

Example: Treadmarks from Rice


Distributed Shared-Memory


Cluster Systems Built with commodity processors

Cost-effectiveOften use the existing resources

Take advantage of the technological advances in commodity processors

Not tied to a single vendorGeneric components means

Competitive price Multiple sources of supply


Cluster Systems (cont’d)

Several typesDedicated set of workstations (DoW)

Specifically built as a parallel systemRepresents one extremeDedicated to parallel workload

No serial workload

Closely related to the distributed-memory MIMD Communication network latency tends to be high

Example: Fast Ethernet



Several types (cont’d)

Privately-owned workstations (PoW)Represents the other extremeAll workstations are privately owned

Idea is to harness unused processor cycles for parallel workload

Receives local jobs from owners Local jobs must receive higher priority

Workstations might be dynamically removed from the pool Owner shutting down/resetting the system, keyboard/mouse activity



Several types (cont’d)

Community-owned workstations (CoW)All workstations are community-owned

Example: Workstations in a graduate labIn the middle of DoW and PoW

In PoW, a workstation could be removed when there is owner activity

Not so in CoW systems Parallel workload continues to run

Resource management should take these differences into account



BeowulfUse PCs for parallel processingClosely resembles a DoW

Dedicated PCs (no scavenging of processor cycles)A private system network (not a shared one)Open design using public domain software and tools

Also known asPoPC (Pile of PCs)



Beowulf (cont’d)

AdvantagesSystems not tied to a single manufacturer

Multiple vendors supply interchangeable components Leads to better pricing

Technology tracking is straightforwardIncremental expandability

Configure the system to match user needs Not limited to fixed, vendor-configured system



Beowulf (cont’d)

Example systemLinux NetworX designed the largest and most powerful

Linux cluster Delivered to Lawrence Livermore National Lab (LLNL) in 2002 Uses 2,304 Intel 2.4 GHz Xeon processors

Peak rating: 11.2 Tflops Aggregate memory: 4.6 TB Aggregate disk space: 138.2 TB

Ranked 5th fastest supercomputer in the world


ASCI System


Dataflow Systems

Different from control flowAvailability of data determines which instructin should

executed

Example: A = (B + C) * (D – E)On von Neumann machine

Takes 6 instructions Sequential dependency

add B,Cstore T1sub D,Estore T2mult T1,T2store A


Dataflow Systems (cont’d)

Addition & subtraction can be done in parallel

Dataflow supports fine-grain parallelism Causes implementation

problems To overcome these

difficulties Proposed hybrid

architectures


Dataflow Systems (cont’d)

Data flows around the ring Matching unit arranges

data into sets of matched operands Released to obtain

instruction from instruction store

Any new data produced is passed around the ring

Manchester dataflow machine


Interconnection Networks

A critical component in many parallel systems Four design issues

Mode of operationControl strategySwitching methodTopology


Interconnection Networks (cont’d)

Mode of operationRefers to the type of communication usedAsynchronous

Typically used in MIMD

SynchronousTypically used in SIMD

Mixed



Control strategyRefers to how routing is achieved

Centralized control Can cause scalability problem Reliability is an issue Non-uniform node structure

Distributed control Uniform node structure Improved reliability Improved scalability



Switching methodTwo basic types

Circuit switching A complete path is established Good for large data transmission Causes problems at high loads

Packet switching Uses store-and-forward method Good for short messages High latency



Switching method (cont’d)

Wormhole routingUses pipelined transmission

Avoids the buffer problem in packet switching

Complete (virtual) circuit is established as in circuit switching

Avoids some of the problems associated with circuit switching

Extensively used in current systems



Network topologyStatic topology

Links are passive and staticCannot be reconfigured to provide direct connectionUsed in distributed-memory MIMD systems

Dynamic topologyLinks can be reconfigured dynamically

Provides direct connectionUsed in SIMD and shared-memory MIMD systems



Dynamic networksCrossbar

Very expensiveLimited to small sizes

Shuffle-exchangeSingle-stageMultistage

Also called MIN (Multistage interconnection network)



Crossbarnetwork



Shuffle-exchange networksUse a switching box

Gives the capability to dynamically reconfigure the networkDifferent types of switches

2-function 4-function

Connections between stages follow the shuffle patternPerfect shuffle

Think of how you mix a deck of cards



2-function switches

4-function switches

0

1 0

1



Perfect shuffle



Buffers

All outputs & inputs are connected like this

Single-stage shuffle-exchange network



MIN





IBM SP2 switch



Static interconnection networksComplete connection

One extremeHigh cost, low latency

Ring networkOther extremeLow cost, high latency

A variety of networks between these two extremes



Complete connection Ring Chordal ring



Tree networks





Hypercube networks

1-d 2-d 3-d



A hierarchical network


Future Parallel Systems

Special-purpose systems+ Very efficient

+ Relatively simple

- Narrow domain of applications

May be cost-effectiveDepends on the application


Future Parallel Systems (cont’d)

General-purpose systems+ Cost-effective

+ Wide range of applications

Decreased speed

Decreased hardware utilization

Increased software requirements



In favour of special-purpose systemsHarold Stone argues

Major advantage of general-purpose systems is that they are

economical due to their wide area of applicability

Economics of computer systems is changing rapidly because

of VLSI

Makes the special-purpose systems economically viable



In favour of both types of systemsGajski argues

Problem space is constantly expandingSpecial-purpose systems can only be designed to solve

“mature” problemsAlways new applications for which no “standardized”

solution existsFor these applications, general-purpose systems are useful


Performance

Amdahl’s lawSerial fraction of a program: Parallel fraction: 1 Execution time on n processors

Speedup S(n) =

T(n) = T(1)

T(1) (1 – )n

n n + (1 – ) Amdahl’s law


Performance (cont’d)

n = 1% = 10% = 25%

10 9.17 5.26 3.08

20 16.81 6.90 3.48

30 23.26 7.69 3.64

40 28.76 8.16 3.72

50 33.56 8.47 3.77

100 50.25 9.17 3.88



Gustafson-Barsis lawObtained a speedup of 1000 on a 1024-node nCUBE/10

For the problem, values ranged from 0.4% to 0.8% Won Gordon Bell prize in 1988

Amdahl’s law predicts a speedup of 201 to 112! Assumes that (1 ) is independent of n

Problem scales up with systemT(1) = + (1 ) n T(n) = + (1 ) = 1

Speedup S(n) = n – (n – 1)



n = 1% = 10% = 25%

10 9.91 9.1 7.75

20 19.81 18.1 15.25

30 29.71 27.1 22.75

40 39.61 36.1 30.25

50 49.51 45.1 37.75

100 99.01 90.1 75.25Last slide

Documents

Introduction Prof. Sivarama Dandamudi School of Computer Science Carleton University