Chap. 4 Part 1 - uoguelph.ca

Chap. 4 Part 1

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Part 2 of textbook: Parallel Abstractions

How can we think about conducting computations in parallel before getting down to coding?

“abstraction” higher-level concepts than

program code, with details missing

Covered by ch 4 “First Steps toward par. prog.” and ch 5 “Scalable Algorithmic Techniques”

Authors’ “parallel pseudocode” for specifying par. algos “without biasing toward a programming language”


Two basic ways to organize parallel computations

“How are we going to put all these processors to work on this problem?” “What can we find for them to do?”

Based on analyzing either the data or the process/aka task (in sense of steps to be taken)


Essence of data parallel (DP)

Apply same operations at once to many different data items

Ultimate example SIMD instructions

Master/workers typical DP pattern

provided the workers are all doing same operations (on portions of the data set)

DP scales by increasing no. of workers (each processing less data)

DP wins by applying parallelism to instances of data that can be worked on simultaneously


Essence of task parallel (TP)

Tasks (processes, threads) specialized to different stages of calculation through which all data instances pass

Pipeline typical TP pattern

TP scales by increasing no. of stages (each performing fewer operations)

TP wins on basis of increasing throughput by applying parallelism to the subtasks


Example: Red Cross blood drive in Peter Clark Hall

Problem: taking blood donations from large no. of people without being too time-consuming

“Data set” = the donors

“Processors” = RC personnel & volunteers


1a 1b Queue 1 Queue 2 1c, 1d 2 a-e 3

Operations per donor

1) Screening:

a) Identify & check file

b) Test sample for iron & blood type

c) Take temperature & blood pressure

d) Answer health questions

2) Donation:

a) Lie down

b) Sanitize arm

c) Stick needle

d) Collect blood (obeying qty & time limits)

e) Remove needle

3) Recovery:

a) Rest and snack


• Where’s the TP? • Where’s the DP?

1a 1b Queue 1 Queue 2 1c, 1d 2 a-e 3

Pseudocode for expressing parallel algorithms

Authors’ invention “Peril-L”

Represents additional parallel constructs on top of conventional pseudocode

Conceptually targets CTA, so can distinguish local vs. non-local memory refs.

Will later see how easy to translate into certain parallel programming languages

Peril-L keywords & features


forall: parallel fork/join

Looks like loop!

Consider as spawning N threads in lieu of one thread doing N iterations

“index” variable has separate value in each thread

for i=1..3 vs. forall (i in (1..3))

how many i variables?

big diff bet iterations & threads!

some iteration could influence later one(s)

iterations can have fixed/var qty; how about threads??

What details about the threads execution are we (intentionally) leaving unspec’d at this level of abstraction?

how to spawn/fork & join (pthreads on cores,, cluster processes, or what?)

no. of processors (P)

distribution of T threads to P processors

T/P threads per processor, executing concurrently (but not truly parallel)

choosing P threads from pool of T, executing in parallel, repeat till all T executed


123 i

1 i 2 i 3 i

Details left unspecified

How to spawn/fork & join?

No. of processors (P)

Distribution of T threads when P<T (aka “oversubscribed”)

T/P threads per processor

Concurrent, not all truly parallel

Choosing P threads from pool of T

Repeat till all T executed


Inter-thread synchronization

exclusive: denotes critical section

implicit mutex

barrier: where all threads “check in”, then all continue

For this to work, all threads have to be “active” even if P<T

Suspend a thread that’s reached the barrier and run another one, continue till all arrive, then waking all


Local vs. global variables

Local if declared inside forall block

Per-thread copies, not visible to other threads or outside block

Global (underlined) if declared outside

Indicates lambda latency!

All arrays start with 0 index


Global memory conventions

Concurrent reads to same variable are OK, writes are serialized (last wins)

But concurrent writes non-deterministic!

If you don’t like that, insert explicit sync (exclusive)

Models worst case that happens with real HW

Forces you to pay attention to that and deal with it explicitly at program level


Accessing global memory

2 methods, you choose, you pay:

Just reference a global variable in the pseudocode

Pays lambda penalty on each access!

Careful to use “exclusive” to ensure consistency!

“Localize” some/all global data via explicit call to localize() pseudo-function

Pays lambda penalty one time


Localization convention (p93 code sample)

int allData[n]; Global data structure for n qty. data

forall (threadID in(0..P-1)) Spawn P threads

{ int size=n/P; Compute size of the local allocation

int locData[size] = localize(allData[]); … }

In Peril-L pseudocode, represents programmer’s choice to pay lambda penalty once (=λ∙size) per thread for global access

After that, locData[i] is fast access

What does it mean? How does it work?


Inside localize() pseudo-func.

Is a “local copy” actually made?

Conceptually “no”

locdata is like an alias for that thread’s portion of allData

What about mismatch between locdata’s size and allData’s?

localize() automagically maps local array to thread’s portion of global array


Can I call localize()?

This is pseudocode, not real library call!

Represents mechanism used to access global data on your platform in λ time

SMP: main mem L1 cache auto transfer

non-SM: message from another node

Reading from localized data is fast SMP: from L1 cache

non-SM: from local node’s memory


Writing to localized data

Because it’s an alias, corresponding global data also changes (in principle)

SMP: cache coherency HW auto-updates main memory (and other L1 caches)

non-SMP: requires sending message

But localized write is fast

SMP: changes only L1 cache (initially)

non-SMP: changes node’s local memory


Who pays lambda for writing?

Convention is that reader of global data will be charged for the sync cost

SMP: reflects lazy update of main mem. with relaxed consistency model and MESI protocol

non-SMP: only one message per reader needs to be sent


Careful writing localized data!

Updates affect corresponding global data

If no intention of inter-thread communication, no problem

Writes by multiple threads not interfering with each other’s data

Otherwise, opens up possible corruption, data races between reading/writing threads

Must use some sync mechanism (shown later)


“Owner Computes” style (p94)

Promoted by localization convention

Lets thread take ownership of portion of data set

Avoid requirement for exclusive lock access to entire global data structure by partitioning data among threads


Localize() is smart!

Forces programmer to explicitly recognize and plan how to manage biggest problem of parallel computing:

Memory bandwidth bottleneck

Can manage at algo. level with Peril-L convention

Another magical pseudo-function:

mySize(global data set, my index)

When data doesn’t divide evenly into P chunks


Global memory & CTA

As observed before, CTA doesn’t have “global mem” (GM) per se

Conceptually, dispersed to one or more processors, in their respective local mems.

To get at it, you have to make a non-local mem ref. via the relevant processor

Looks like a “shell game”

first, you have GM (pseudocode); then, you don’t (CTA); then, you (may) have it again (multicores/SMP)!


Big picture: 3 layers

Top layer = Peril-L pseudocode

Programmer’s view is having both global & local memory available

Middle layer = CTA model, like a VM

Doesn’t have global mem, but can simulate it

Level where we conduct algo. performance estimation—O() complexity and lambda cost

Bottom layer = physical computer

global mem may (multicore/SMP) or may not (cluster) be available, in latter case can be simulated by messaging


Benefits of layered approach

Allows complete disconnection of a parallel algo. from a particular HW platform, while still capturing key property of the platforms: non-local memory latency

Makes any pseudocoded algo. portable among wide variety of platforms


Summary

Building on a generalized model of parallel processors,

started to define a pseudocode targeted for describing parallel algos

in a HW-agnostic way

that still recognizes the HW issues which affect parallel performance!


Documents

Chap. 4 Part 1 - uoguelph.ca