Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Data Science: Advanced-R Boot CampParallelism
Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD
22 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 202022 February 2020
1/23
2/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
Table of contents (1 of 1)
1 Intro.
2 AmdahlA little math
3 Set-up
4 More meatA more substantialexample
5 Hands-on
A toughy6 Q & A7 Conclusion8 References9 Files
c©Old Dominion University
3/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
What are we going to cover?
We’re going to talk about doing morethan one thing at a time.
Look at, and understand Amdahl’sLaw
See the effect of taking a task andmoving it from sequential toparallel
Determine when it makes sense tostop going parallel
c©Old Dominion University
4/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A little math
Amdahl’s Law [1]
Time for serial executiondef.== T (1)
Portion that can beparalyzed
def.== P ∈ [0, 1]
Number of parallel resourcesdef.== n
T (n) = T (1) ∗ 1(1−P)+P
n
Speed updef.== S(n)
S(n) = T (1)T (n)Dr. Gene Amdahl (circa 1960)
c©Old Dominion University
5/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A little math
Amdahl’s Law (A summary)
Division and measurement of serial and parallel operations appearstime and again. (Shades of Mandelbrot.)
“Make the common fast.”
“Make the fast common.”
Understand what parts haveto be done serially
Understand what parts canbe done in parallel
Need to factor in “overhead” costs when computing speed up.
c©Old Dominion University
6/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A little math
Some questions are easily stated, . . .
Which of these are paralizable(and why)?
1 a[i ] = b[i ] + c[i ]
2 a[i ] = f (b)
3 a[i ] = a[i − 1] + b[i − 1]4 a = b + c
c©Old Dominion University
7/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
The basics
Fundamental to parallelcomputing (in R on onemachine)
1 Load the parallel library
2 Find out the number ofcores available
3 Create a “cluster”
4 Make available objects thateach cluster node will need
5 Spread out work on cluster,and gather results
6 Shutdown the cluster
library(parallel)
# Calculate the number of cores
no_cores
8/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
Sometimes you need to pass things to the cluster
Continuing from the previous slide.
cl
9/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
Have the system do the heavy lifting
Introduce new libraries and operators (attached file: parallel.R):
library(iterators)
library(foreach)
library(doParallel)
library(parallel)
no_cores
10/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
Compare execution time of local vs. parallel
It doesn’t look good (attached file: parallel.R):
...
caseOne
11/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
Same image as previous slide.
Parallel processing takes longer than single CPU!c©Old Dominion University
12/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A more substantial example
Taken from section 19.1 [3].
A more substantial example dealingwith machine learning (ML) asapplied to a subset of the AmericanCommunity Survey (ACS) collectedyearly by the US Census Bureau.Data for New York State isprocessed to identify:
Strongest indicators of havinghigh income (number ofworkers in the family and notbeing on foodstamps)
Strongest indicators of havinglow income (using coal heatand living in a mobile home)
Load attached 19-1.R file into the IDE.c©Old Dominion University
13/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A more substantial example
Things of note about the 19.1 code:
First 27 lines are demoLines 28 through 59 aremodel buildingLines 60 through 98 aremodel refiningLines 99 through 138 runthe model with differentvalues for alpha on yourpersonal cluster
Lines 139 through 184visualize optimal controlparameters
Lines 185 through 199visualize how well themodel fits
Lines 199 though 203provide the answers.
c©Old Dominion University
14/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A more substantial example
“Playing” with Lander’s model
Load the file parallel-02.R into the IDE.
The first 109 lines arefunctionally the same as19-1.RThe variable loader online 112 controls how muchwork each core will perform(higher is more)Lines 114 through 173 runthe model spread across
the cores, and plot absoluteand relative executiontimes based on the numberof cores used
Lines 181 through 192compute the amount of theprogram execution that wasactually parallelizable
Increasing the “load” decreases the influence of misc. actionsoutside R’s control.
c©Old Dominion University
15/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A more substantial example
Measured performance 16 tasks
AMD FX(tm)-8120 Eight-Core Processor, 8GB RAM, Ubuntu18.04c©Old Dominion University
16/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A more substantial example
Measured performance 40 tasks
AMD FX(tm)-8120 Eight-Core Processor, 8GB RAM, Ubuntu18.04c©Old Dominion University
17/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A more substantial example
Measured performance 80 tasks
AMD FX(tm)-8120 Eight-Core Processor, 8GB RAM, Ubuntu18.04c©Old Dominion University
18/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A more substantial example
Empirical conclusion
Based on looking at the collected data:
There is overhead associated with starting a parallel resource.
As the number of tasks increases, the parallel startup up costis amortized across all tasks and becomes less of an issue.
The computed percentage of parallelizable work is relativelyconstant, and is more constant as the number of tasksincreases.
It isn’t always clear when it makes sense to parallelize a task.
c©Old Dominion University
19/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
A toughy
Exploring parallel processing
1 Write a parallel processing script meeting these requirements:
1 Is spread across all cores in your CPU2 Uses a function that you create, where the function takes a
numeric argument, and returns the next to last character ofthe string represenation of the number, or “-” if not possible.
3 The range of numeric input values is from 0 to 100.4 The parallel process returns a data frame.
2 Amdahl’s Law has some interesting visible and hidden aspects.
1 Write a script that plots the effect of Amdahl’s Law where thesequential portion ranges from 5% to 95%, and the number ofprocessors ranges from 1 to 10,000.
2 Parallel resources are not free. They take setup time,communication time, and location time. Assume that thesetimes are 0.1 seconds per processing unit. If a sequentialimplementation takes 10 seconds to execute, and we assumethe same ranges from the previous step, when does it notmake sense to use parallel processing?
c©Old Dominion University
20/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
Q & A time.
Q: How does a hacker fix afunction which doesn’t work forall of the elements in its domain?A: He changes the domain.
c©Old Dominion University
21/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
What have we covered?
Amdahl’s Law is simple, and hashidden aspects.Even if a task is parallelizable, itmake not make sense to do so.There is an upper limit on thenumber of parallel resources itmakes sense to use.
Next: Data reshaping and subsetting
c©Old Dominion University
22/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
References (1 of 1)
[1] Gene M Amdahl,Validity of the single processor approach to achieving large scale computing capabilities,Proceedings of the Spring Joint Computer Conference, ACM,1967, pp. 483–485.
[2] Max Gordon, How-to go parallel in r basics + tips,https://www.r-bloggers.com/how-to-go-parallel-in-
r-basics-tips/, 2015.
[3] Jared P Lander, R for Everyone, Pearson Education, 2014.
c©Old Dominion University
https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/
23/23
Intro. Amdahl Set-up More meat Hands-on Q & A Conclusion References Files
Files of interest
1 Code snippets
2 parallel.R
3 19-1.R
4 parallel-02.R
c©Old Dominion University
## First codes## https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/
rm(list=ls())
library(parallel)
# Calculate the number of coresno_cores