“By the end of this chapter, you should have obtained a basic understanding of how modern...

CHAPTER 2 PARALLEL PROGRAMMING BACKGROUND“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.”

TRADITIONAL PARALLEL MODELSSerial ModelSISDParallel ModelsSIMDMIMDMISD*

S = SingleM = MultipleD = DataI = Instruction

VOCABULARY & NOTATION (2.1)

Task vs. Data: tasks are instructions that operate on data; modify or create new

Parallel computation multiple tasksCoordinate, manage,

DependenciesData: task requires data from another taskControl: events/steps must be ordered (I/O)

TASK MANAGEMENT – FORK-JOIN

Fork: split control flow, creating new control flow

Join: control flows are synchronized & merged

GRAPHICAL NOTATION – FIG. 2.1

Task Data Fork Join Dependency

STRATEGIES (2.2)Data Parallelism Best strategy for Scalable ParallelismP. that grows as data set/problem size grows

Split data set over set of processors with task processing each set

More Data More Tasks

STRATEGIES

Control Parallelism orFunctional DecompositionDifferent program functions run in parallelNot scalable – best speedup is constant factor

As data grows, parallelism doesn’tMay be less/no overhead

REGULAR VS. IRREGULAR PARALLELISMRegular: tasks are similar with predictable dependenciesMatrix multiplication

Irregular: tasks are different in ways that create unpredictable dependenciesChess program

Many problems contain combinations

HARDWARE MECHANISMS (2.3)

Most important 2Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition

Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism

BRANCH STATEMENTS Detrimental

to Parallelism• Locality• Pipelining• HOW?

MASKING - ALL CONTROL PATHS ARE EXECUTED BUT RESULTS ARE MASKED OUT – NOT USED

if (a&1)a = 3*a + 1

elsea=a/2

if/else contains branch statementsMasking: Both parts are executed in parallel, keep only one result

p = (a&1)t = 3*A + 1if (p) a = tt = a/2if (!p) a = t

No branches – single control of flowMasking works as if it were coded this way

MACHINE MODELS (2.4)

CoreFunctional UnitsRegistersCache memory – multiple levels

CACHE MEMORY

Blocks (cache lines) – amount fetchedBandwidth – amount transferred concurrently

Latency – time to complete transferCache Coherence – consistency among copies

VIRTUAL MEMORYMemory system Disk storage + chip memory Allows programs larger than memory to run Allows multiprocessingSwaps PagesHW maps logical to physical addressData locality important to efficiencyPage Fault Thrashing

PARALLEL MEMORY ACCESS

Cache (multiple)NUMA – Non-Uniform Memory AccessPRAM – Parallel Random Access Memory ModelTheoretical ModelAssumes - Uniform memory access times

PERFORMANCE ISSUES (2.4.2)

Data LocalityChoose code segments that fit in cacheDesign to use data in close proximity Align data with cache lines (blocks)Dynamic Grain Size – good strategy

PERFORMANCE ISSUES

Arithmetic IntensityLarge number of on-chip compute operations for every off-chip memory access

Otherwise, communication overhead is high

Related – Grain size

FLYNN’S CATEGORIES

Serial ModelSISD

Parallel ModelsSIMD –

Array processorVector processor

MIMD Heterogeneous computer

ClustersMISD* - not useful

CLASSIFICATION BASED ON MEMORY

Shared Memory – each processor accesses a common memoryAccess issuesNo message passingPC usually has small local memory

Distributed Memory – each processor has a local memory

Send explicit messages between processors

EVOLUTION (2.4.4)

GPU – Graphics acceleratorsNow general purpose

Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s)

Heterogeneous – different (hardware working together)

Host Processor – for distribution, I/O, etc.

PERFORMANCE (2.5)

Various interpretations of PerformanceReduce Total Time for computationLatency

Increasing Rate at which series of results are computedThroughput

Reduce Power Consumption*Performance Target

LATENCY & THROUGHPUT (2.5.1)Latency: time to complete a taskThroughput: rate at which tasks are completeUnits per time (e.g. jobs per hour)

OMIT SECTION 2.5.3 – POWER

SPEEDUP & EFFICIENCY (2.5.2)

Sp = T1 / Tp T1: time to complete on 1 processor

Tp: time to complete on p processors

REMEMBER: “time” means number of instructions

E = Sp / P

= T1 / P*Tp

E = 1 is “perfect”

Linear Speedup – occurs when algorithm runs P-times faster on P processors

SUPERLINEAR SPEEDUP (P.57)

Efficiency > 1Very RareOften due to HW variations (cache)Working in parallel may eliminate some work that is done when serial

AMDAHL & GUSTAFSON-BARSIS (2.5.4, 2.5.5)Amdahl: speedup is limited by amount of serial work required

G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases

See examples

Total operations (time) for taskT1 = WorkP * Tp = Work T1 = P * Tp ?? Rare due to ???

WORK-SPAN MODEL (2.5.6)Describes Dependencies among Tasks & allows for estimated timesRepresents Tasks: DAG (Figure 2.8) Critical Path – longest pathSpan - minimum time of Critical Path

Assumes Greedy Task Scheduling – no wasted resources, time

Parallel Slack – excess parallelism, more tasks than can be scheduled at once

WORK-SPAN MODEL

Speedup <= Work/Span

Upper Bound: ??No more than…

ASYMPTOTIC COMPLEXITY (2.5.7)

Comparing Algorithms!!Time Complexity: defines execution time growth in terms of input size

Space Complexity: defines growth of memory requirements in terms of input size

Ignores constantsMachine independent

BIG OH NOTATION (P.66)

Big OH of F(n) – Upper BoundO(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No

*Memorize

BIG OMEGA & BIG THETA

Big Omega – Functions that define Lower Bound

Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds

CONCURRENCY VS. PARALLEL

Parallel work actually occurring at same timeLimited by number of processors

Concurrent tasks in progress at same time but not necessarily executing“Unlimited”

Omit 2.58 & most of 2.59

PITFALLS OF PARALLEL PROGRAMMING (2.6)

Pitfalls = Issues that can cause problemsSynchronization – often required Too little non-determinismToo much reduces scaling, increases time & may cause deadlock

RACE CONDITIONS (2.6.1)

Situation in which final results depend upon order tasks complete work

Occurs when concurrent tasks share memory location & there is a write operation

Unpredictable – don’t always cause errorsInterleaving: instructions from 2 or more tasks are executed in an alternating manner

RACE CONDITIONS ~ EXAMPLE 2.2Task AA = XA += 1X = A

Task BB = XB += 2X = B

Assume X is initially 0.

What are the possible results?

So, Tasks A & B are not REALLY independent!

RACE CONDITIONS ~ EXAMPLE 2.3Task AX = 1A = Y

Task BY = 1B = X

Assume X & Y are initially 0.

What are the possible results?

SOLUTIONS TO RACE CONDITIONS (2.6.2)

Mutual Exclusion, Locks, Semaphores, Atomic OperationsMechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start

Does not always solve the problem – may still depend upon which task executes first

DEADLOCK (2.6.3)

Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP

Recommendations for avoidanceAvoid mutual exclusionHold at most 1 lock at a timeAcquire locks in same order

DEADLOCK – NECESSARY & SUFFICIENT CONDITIONS

1. Mutual Exclusion Condition: The resources involved are non-shareable.Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released.

2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources.Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes.

3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted.Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition

The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.

STRANGLED SCALING (2.6.4)

Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section

Notes1 large lock is faster but blocks other processes

Time consideration for set/release of many locks

Example: lock row of matrix, not entire matrix

LACK OF LOCALITY (2.6.5)

Two Assumptions for good localityTemporal Locality – access same location soon

Spatial Locality – access nearby location soon

Reminder: Cache Line – block that is retrievedCurrently – Cache miss ~~ 100 cycles

LOAD IMBALANCE (2.6.6)

Uneven distribution of work over processors

Related to decomposition of problemFew vs Many Tasks – what are implications?

OVERHEAD (2.6.7)

Always in parallel processingLaunch, synchronize

Small vs larger processors ~ Implications???

~the end of chapter 2~

“By the end of this chapter, you should have obtained a basic understanding of how modern...

Documents

DOT/FAA/TC-17/50 Commercial Off-The-Shelf …COTS controllers (i.e., not able to execute application software) and core processors alone (i.e., not providing peripheral hardware functions

Somos a Execute

T. Scott Dattalo 24 DECEMBER 1999 - University of Albertaee401/resource/manuals/gpsim.pdf · node Add or display stimulus nodes processor Add/list processors quit Quit gpsim run Execute

Execute your Strategy

11th Gen Intel vPro mobile processors & Intel processors

Execute Automation Interface

GPU Computing: The Democratization of Parallel Computinggilbert/cs240a/old/cs240aSpr... · 2011. 5. 23. · PBSM Thread Processors Thread Processors Thread Processors Thread Processors

Intel Celeron D Processor 3xx Sequence€¦ · details on which processors support EM64T or cons ult with your system vendor for more information. Enabling Execute Disable Bit functionality

Network Processors A generation of multi-core processors

Execute Fastest

アクセラレータとスーパーコンピュータ - Keio …PBSM PBSM Thread Processors PBSM PBSM Thread Processors PBSM PBSM Thread Processors PBSM PBSM Thread Processors PBSM

Practical Linux Examples · 2019-03-06 · is an CBSU in-house “driver” script (written in perl) It will execute tasks listed in . TaskFile. using up to . NP. processors The first

VLIW Processors T15 Advanced Processors: ECE 4750 Computer ... · T15 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision:

THE STATETHE STATE - Bhuttobhutto.org/Acrobat/Judgment Lahore High Court.pdfGhulam Mustafa obtained the requisite arms and ammunition with the help of Mian Muhammad Abbas to execute

Lung Nodule Detection from CT scan using Intel® processors · QuEST has executed inference performance tests on their optimized deep learning model and obtained impressive results

Network Processors A generation of multi-core processors INF5063: Programming Heterogeneous Multi-Core Processors April 15, 2015

STREAMANALTIX 2.1.6 Processors · 2017. 2. 1. · flexibility to execute data pipelines using a stream processing engine of choice depending upon the application use-case, taking

Lesson 8 CPUs Used in Personal Computers. This lesson introduces: Intel Processors AMD Processors Cyrix Processors Motorola Processors RISC Processors

Brunei Berhad (TelBru) seeks qualified vendors to execute the above work. Documents may be obtained from TelBru's Vendor Registration Unit, Supply Chain Management, Level 4, RB Plaza,

PRAM ALGORITHMS · 2015. 7. 27. · A PRAM consists of a control unit, global memory, an unbounded set of processors, each with its own private memory. Active processors execute identical