Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Chap. 4 Part 1
CIS*3090 Fall 2016
Fall 2016 CIS*3090 Parallel Programming 1
Part 2 of textbook: Parallel Abstractions
How can we think about conducting computations in parallel before getting down to coding?
“abstraction” higher-level concepts than
program code, with details missing
Covered by ch 4 “First Steps toward par. prog.” and ch 5 “Scalable Algorithmic Techniques”
Authors’ “parallel pseudocode” for specifying par. algos “without biasing toward a programming language”
Fall 2016 CIS*3090 Parallel Programming 2
Two basic ways to organize parallel computations
“How are we going to put all these processors to work on this problem?” “What can we find for them to do?”
Based on analyzing either the data or the process/aka task (in sense of steps to be taken)
Fall 2016 CIS*3090 Parallel Programming 3
Essence of data parallel (DP)
Apply same operations at once to many different data items
Ultimate example SIMD instructions
Master/workers typical DP pattern
provided the workers are all doing same operations (on portions of the data set)
DP scales by increasing no. of workers (each processing less data)
DP wins by applying parallelism to instances of data that can be worked on simultaneously
Fall 2016 CIS*3090 Parallel Programming 4
Essence of task parallel (TP)
Tasks (processes, threads) specialized to different stages of calculation through which all data instances pass
Pipeline typical TP pattern
TP scales by increasing no. of stages (each performing fewer operations)
TP wins on basis of increasing throughput by applying parallelism to the subtasks
Fall 2016 CIS*3090 Parallel Programming 5
Example: Red Cross blood drive in Peter Clark Hall
Problem: taking blood donations from large no. of people without being too time-consuming
“Data set” = the donors
“Processors” = RC personnel & volunteers
Fall 2016 CIS*3090 Parallel Programming 6
1a 1b Queue 1 Queue 2 1c, 1d 2 a-e 3
Operations per donor
1) Screening:
a) Identify & check file
b) Test sample for iron & blood type
c) Take temperature & blood pressure
d) Answer health questions
2) Donation:
a) Lie down
b) Sanitize arm
c) Stick needle
d) Collect blood (obeying qty & time limits)
e) Remove needle
3) Recovery:
a) Rest and snack
Fall 2016 CIS*3090 Parallel Programming 7
• Where’s the TP? • Where’s the DP?
1a 1b Queue 1 Queue 2 1c, 1d 2 a-e 3
Pseudocode for expressing parallel algorithms
Authors’ invention “Peril-L”
Represents additional parallel constructs on top of conventional pseudocode
Conceptually targets CTA, so can distinguish local vs. non-local memory refs.
Will later see how easy to translate into certain parallel programming languages
Peril-L keywords & features
Fall 2016 CIS*3090 Parallel Programming 8
forall: parallel fork/join
Looks like loop!
Consider as spawning N threads in lieu of one thread doing N iterations
“index” variable has separate value in each thread
for i=1..3 vs. forall (i in (1..3))
how many i variables?
big diff bet iterations & threads!
some iteration could influence later one(s)
iterations can have fixed/var qty; how about threads??
What details about the threads execution are we (intentionally) leaving unspec’d at this level of abstraction?
how to spawn/fork & join (pthreads on cores,, cluster processes, or what?)
no. of processors (P)
distribution of T threads to P processors
T/P threads per processor, executing concurrently (but not truly parallel)
choosing P threads from pool of T, executing in parallel, repeat till all T executed
Fall 2016 CIS*3090 Parallel Programming 9
123 i
1 i 2 i 3 i
Details left unspecified
How to spawn/fork & join?
No. of processors (P)
Distribution of T threads when P<T (aka “oversubscribed”)
T/P threads per processor
Concurrent, not all truly parallel
Choosing P threads from pool of T
Repeat till all T executed
Fall 2016 CIS*3090 Parallel Programming 10
Inter-thread synchronization
exclusive: denotes critical section
implicit mutex
barrier: where all threads “check in”, then all continue
For this to work, all threads have to be “active” even if P<T
Suspend a thread that’s reached the barrier and run another one, continue till all arrive, then waking all
Fall 2016 CIS*3090 Parallel Programming 11
Local vs. global variables
Local if declared inside forall block
Per-thread copies, not visible to other threads or outside block
Global (underlined) if declared outside
Indicates lambda latency!
All arrays start with 0 index
Fall 2016 CIS*3090 Parallel Programming 12
Global memory conventions
Concurrent reads to same variable are OK, writes are serialized (last wins)
But concurrent writes non-deterministic!
If you don’t like that, insert explicit sync (exclusive)
Models worst case that happens with real HW
Forces you to pay attention to that and deal with it explicitly at program level
Fall 2016 CIS*3090 Parallel Programming 13
Accessing global memory
2 methods, you choose, you pay:
Just reference a global variable in the pseudocode
Pays lambda penalty on each access!
Careful to use “exclusive” to ensure consistency!
“Localize” some/all global data via explicit call to localize() pseudo-function
Pays lambda penalty one time
Fall 2016 CIS*3090 Parallel Programming 14
Localization convention (p93 code sample)
int allData[n]; Global data structure for n qty. data
forall (threadID in(0..P-1)) Spawn P threads
{ int size=n/P; Compute size of the local allocation
int locData[size] = localize(allData[]); … }
In Peril-L pseudocode, represents programmer’s choice to pay lambda penalty once (=λ∙size) per thread for global access
After that, locData[i] is fast access
What does it mean? How does it work?
Fall 2016 CIS*3090 Parallel Programming 15
Inside localize() pseudo-func.
Is a “local copy” actually made?
Conceptually “no”
locdata is like an alias for that thread’s portion of allData
What about mismatch between locdata’s size and allData’s?
localize() automagically maps local array to thread’s portion of global array
Fall 2016 CIS*3090 Parallel Programming 16
Can I call localize()?
This is pseudocode, not real library call!
Represents mechanism used to access global data on your platform in λ time
SMP: main mem L1 cache auto transfer
non-SM: message from another node
Reading from localized data is fast SMP: from L1 cache
non-SM: from local node’s memory
Fall 2016 CIS*3090 Parallel Programming 17
Writing to localized data
Because it’s an alias, corresponding global data also changes (in principle)
SMP: cache coherency HW auto-updates main memory (and other L1 caches)
non-SMP: requires sending message
But localized write is fast
SMP: changes only L1 cache (initially)
non-SMP: changes node’s local memory
Fall 2016 CIS*3090 Parallel Programming 18
Who pays lambda for writing?
Convention is that reader of global data will be charged for the sync cost
SMP: reflects lazy update of main mem. with relaxed consistency model and MESI protocol
non-SMP: only one message per reader needs to be sent
Fall 2016 CIS*3090 Parallel Programming 19
Careful writing localized data!
Updates affect corresponding global data
If no intention of inter-thread communication, no problem
Writes by multiple threads not interfering with each other’s data
Otherwise, opens up possible corruption, data races between reading/writing threads
Must use some sync mechanism (shown later)
Fall 2016 CIS*3090 Parallel Programming 20
“Owner Computes” style (p94)
Promoted by localization convention
Lets thread take ownership of portion of data set
Avoid requirement for exclusive lock access to entire global data structure by partitioning data among threads
Fall 2016 CIS*3090 Parallel Programming 21
Localize() is smart!
Forces programmer to explicitly recognize and plan how to manage biggest problem of parallel computing:
Memory bandwidth bottleneck
Can manage at algo. level with Peril-L convention
Another magical pseudo-function:
mySize(global data set, my index)
When data doesn’t divide evenly into P chunks
Fall 2016 CIS*3090 Parallel Programming 22
Global memory & CTA
As observed before, CTA doesn’t have “global mem” (GM) per se
Conceptually, dispersed to one or more processors, in their respective local mems.
To get at it, you have to make a non-local mem ref. via the relevant processor
Looks like a “shell game”
first, you have GM (pseudocode); then, you don’t (CTA); then, you (may) have it again (multicores/SMP)!
Fall 2016 CIS*3090 Parallel Programming 23
Big picture: 3 layers
Top layer = Peril-L pseudocode
Programmer’s view is having both global & local memory available
Middle layer = CTA model, like a VM
Doesn’t have global mem, but can simulate it
Level where we conduct algo. performance estimation—O() complexity and lambda cost
Bottom layer = physical computer
global mem may (multicore/SMP) or may not (cluster) be available, in latter case can be simulated by messaging
Fall 2016 CIS*3090 Parallel Programming 24
Benefits of layered approach
Allows complete disconnection of a parallel algo. from a particular HW platform, while still capturing key property of the platforms: non-local memory latency
Makes any pseudocoded algo. portable among wide variety of platforms
Fall 2016 CIS*3090 Parallel Programming 25
Summary
Building on a generalized model of parallel processors,
started to define a pseudocode targeted for describing parallel algos
in a HW-agnostic way
that still recognizes the HW issues which affect parallel performance!
Fall 2016 CIS*3090 Parallel Programming 26