Parallel R - Virginia Tech...Introduction If you need your code to go faster, you have a few options: Be a more efficient programmer Port your code to C/C++/Fortran Parallelize, ie

Parallel RBob Settlage

Feb 14, 2018

Parallel R

Todays Agenda

Introduction

Brief aside:

Snow

Rmpi

pbdR (more brief)

Conclusions

·

·

R and parallel R on ARC's systems-

·

·

·

·

2/48

R

Programming language and environment for statisticalcomputing

Free

Instrisic support for wide array of statistical functionality

Huge number of user-created packages to add or improvefunctionality

·

·

·

·

3/48

Introduction

If you need your code to go faster, you have a few options:

Be a more efficient programmer

Port your code to C/C++/Fortran

Parallelize, ie use more cores

·

use vector/matrix operations

remove redundant operations

avoid memory copy/preallocate

-

-

-

·

full Monte (.C or .Call)

Rcpp

-

-

·

parallel packages

MPI

GPU…

-

-

-

4/48

An aside: Optimizing R

Pre-allocate Variables

Vectorize (or perhaps apply functions)

Reference: The R Inferno http://www.burns-stat.com/documents/books/the-r-inferno/

·

·

YES:z <- x * y

NO:for (i in 1:length(x)) { z[i] <- x[i] + y[i]}

-

-

·

5/48

http://www.burns-stat.com/documents/books/the-r-inferno/

The Need for Parallelism

6/48

The Need for Parallelismtransistor count

7/48

Parallelism in R

Serial by default

Embarrassing parallelism (e.g. Monte Carlo):

More advanced:

·

Exception: matrix operations using BLAS (ARC systems)-

·

snow

snowfall

-

-

·

Rmpi

pbdR

-

-

8/48

Todays Agenda

Introduction

Brief aside:

Snow

Rmpi

pbdR (more brief)

Conclusions

·

·

R and parallel R on ARC's systems-

·

·

·

·

9/48

R on ARC's Systems

http://www.arc.vt.edu/

Includes ggplot2, rlecuyer, plyr, and several other packaages

Built with OpenBLAS for GCC and MKL

Plotting via Cairo (offline) and X11 (interactive)

Parallel packages provided as part of separate R-parallelmodule build against OpenMPI

·

·

Use MKL_NUM_THREADS or OPENBLAS_NUM_THREADSto control threading

-

·

·

10/48

http://www.arc.vt.edu/

R libs installed on ARC

11/48

Getting started on ARCSystems

Request and account http://www.arc.vt.edu/account

Request a system unit allocationhttp://www.arc.vt.edu/allocations

R documentation http://www.arc.vt.edu/r

These examples and more:https://secure.hosting.vt.edu/www.arc.vt.edu/userguide/r/#examples

·

·

·

·

12/48

http://www.arc.vt.edu/account

http://www.arc.vt.edu/allocations

http://www.arc.vt.edu/r

https://secure.hosting.vt.edu/www.arc.vt.edu/userguide/r/#examples

Snow

NOTE: this is being replaced by the parallel package

Simple Network of Workstations (SNOW)

For embarrassingly parallel tasks

Master/Slave model

·

·

·

13/48

Snow: Start/Stop cluster

library(snow)library(parallel)library(Rmpi) #for reference later, not part of SNOW

## start a cluster with ncoresncores = 5cl <- makeCluster(ncores, type = "MPI")# Initialize RNGclusterSetupRNG(cl, type = "RNGstream")# VERY IMPORTANT, STOP the cluster when finishedstopCluster(cl)

14/48

Snow: Computing

Calls same function across cluster

Parallel versions of apply:

·

clusterCall(cl, fun, …)-

·

clusterApply(cl, x, fun, …)

parApply(cl, X, MARGIN, FUN, …)

parLapply(cl, x, fun, …)

parRapply(cl, x, fun, …)

parCapply(cl, x, fun, …)

-

-

-

-

-

15/48

SNOW: simple example

library(snow)library(parallel)library(Rmpi) #for reference later, not part of SNOW

## start a cluster with ncoresncores = 5cl <- makeCluster(ncores, type = "MPI")# Initialize RNGclusterSetupRNG(cl, type = "RNGstream")clusterApply(cl, 1:2, get("+"), 3)xx <- 1clusterExport(cl, "xx")clusterCall(cl, function(y) xx + y, 2)# VERY IMPORTANT, STOP the cluster when finishedstopCluster(cl)

16/48

Example: Monte Carlo pi

circle

The ratio fo the area of the unit circle to the area of the unitsquare is

SO:

·π4

·

Randomly pick S points in the unit square

Count the number in the unit circle (C)

Then

-

-

- π ≈ 4 CS

17/48

MC : codeπ

# generate n.pts (x,y) points in the unit square determine if they are in# the unit circle return the proportion of points in the unit circle * 4mcpi <- function(n.pts) { m = matrix(runif(2 * n.pts), n.pts, 2) in.ucir = function(x) { as.integer((x[1]^2 + x[2]^2) <= 1) } cir = apply(m, 1, in.ucir) return(4 * mean(cir))}

18/48

MC : parallelizeπ

# start up and initialize the clustercl <- makeCluster(ncores, type = "MPI")clusterSetupRNG(cl, type = "RNGstream")# determine if points are in the unit circlesystem.time({ cir = parSapply(cl, seq(from = 1000, to = 20000, by = 1000), mcpi) #calculate pi})pi.approx = mean(cir)print(pi.approx)# stop the clusterstopCluster(cl)

19/48

MC compute time via SNOW

20/48

MC compute error via SNOW

21/48

MC : An Optimizationexample

π

n.pts <- 5e+05m <- matrix(runif(2 * n.pts), n.pts, 2)in.ucir <- function(x) { as.integer((x[1]^2 + x[2]^2) <= 1)}system.time(apply(m, 1, in.ucir))system.time(as.integer(m[, 1]^2 + m[, 2]^2 <= 1))system.time(cir = parSapply(cl, rep(10000, 500), mcpi))

22/48

MCMC: Metropolis-Hastings

Goal: draw random samples wp density ca. given distribution

Used to model stochastic inputs

Do not need to know normalizing factor

·

·

·

Functions in high dimensions-

23/48

MCMC: Metropolis-Hastings(cont)

Given a:

Choose candidate sample from jumping distribution centeredat initial sample

Accept candidate as new sample:

Repeat

·

Target distribution

jumping distribution

initial sample

-

-

-

·

·

always if candidate is better fit (per target dist)

with prob <1 if candidate is worse fit

-

-

·

24/48

M-H: Code (MC part)

Reference: Lam, Patrick. "MCMC Methods: Gibbs Sampling andthe Metropolis-Hastings Algorithm."

# function to calculate next sample candidate sampletheta.update <- function(theta.cur) { theta.can <- jump(theta.cur) # acceptance probability accept.prob <- samp(theta.can)/samp(theta.cur) # compare with sample from uniform dist (0 to 1) if (runif(1) <= accept.prob) theta.can else theta.cur}

25/48

M-H: code

# function to generate (n.sims-burnin) samplesmh <- function(n.sims, start, burnin, samp, jump) { theta.cur <- start draws <- c() # call theta.update() n.sims times for (i in 1:n.sims) { draws[i] <- theta.cur <- theta.update(theta.cur) } # return the samples after the burn in return(draws[(burnin + 1):n.sims])}

26/48

M-H

27/48

M-H

28/48

M-H: parallelize

# start up and initialize the clustercl <- makeCluster(ncores, type = "MPI")clusterSetupRNG(cl, type = "RNGstream")# samples per core coremh.n.sims.cl <- ceiling(mh.n.sims/ncores)# call mh on each coremh.draws.cl <- clusterCall(cl, mh, mh.n.sims.cl, start = 1, burnin = mh.burnin samp = samp.fcn, jump = jump.fcn)# reduce list to 1-Dmh.draws <- unlist(mh.draws.cl)# stop the clusterstopCluster(cl)

29/48

M-H

30/48

M-H

31/48

SNOW References

Snow Manual http://cran.r-project.org/web/packages/snow/snow.pdf

Snow Functions http://www.sfu.ca/~sblay/R/snow.html

ARC http://www.arc.vt.edu/r

·

·

·

32/48

http://cran.r-project.org/web/packages/snow/snow.pdf

http://www.sfu.ca/~sblay/R/snow.html

http://www.arc.vt.edu/r

MPI

33/48

MPI: Program Models

Examples:

http://cran.r-project.org/web/packages/pbdMPI/vigneIes/pbdMPI-guide.pdf

"Brute Force": Decompose problem

"Task Push": Master creates list of tasks and sends to slaves inround-robin fashion

"Task Pull": Slaves report to master when finished, receivenew tasks

·

·

·

34/48

http://cran.r-project.org/web/packages/pbdMPI/vigneIes/pbdMPI-guide.pdf

Rmpi

User-developed package

Interfce to MPI for R +Master/slave paradigm

Allows parallelism beyond embarrassingly parallel, e.g. SNOW

Provided as part of ARC R module

·

·

·

·

35/48

Rmpi: Starting and Stopping

Load library: library(Rmpi)

Spawn nsl slaves mpi.spawn.Rslaves(nslaves=nsl)

Shut down slaves (IMPORTANT) mpi.close.Rslaves()

Clean up and quit R mpi.quit()

·

·

·

·

36/48

Rmpi basics

Run an Rmpi script like any other R script: Rscript mcpi_rmpi.r

Get the number of processes (the number of slaves +1)mpi.comm.size()

Get the rank of a process: mpi.comm.rank() + Mater: 0 - Slave:1+

·

·

·

37/48

Rmpi: Executing Remotely

#Execute on the master:paste("I am",mpi.comm.rank(),"of",mpi.comm.size()) [1] "I am 0 of 3"

#Execute Rcommand on the slaves: mpi.bcast.cmd(Rcommand)

#Execute on the slaves and return to master: result <- mpi.remote.exec(Rcommand)

#Returns nslaves-length list

38/48

Rmpi: Hello World

39/48

Rmpi communications

Broadcast a function or variable from mater to slavempi.bcast.Robj2slave(object)

Send object to destination mpi.send.Robj(object, destination,tag)

Receive a sent message recv <-mpi.recv.Robj(mpi.any.source(),mpi.any.tag())

Get tag from received message recv.info <-mpi.get.sourcetag()

·

·

·

·

40/48

Rmpi Example: Passmessages

# Function to pass message to next slavemessage.pass <- function() { # Get each slave's rank myrank <- mpi.comm.rank() # Get partner slave's rank (some hackery to avoid master) otherrank <- (myrank + 1)%%mpi.comm.size() otherrank <- otherrank + (otherrank == 0) # Send a message to the partner mpi.send.Robj(paste("I am rank", myrank), dest = otherrank, tag = myrank # Receive the message & tag (includes source) recv.msg <- mpi.recv.Robj(mpi.any.source(), mpi.any.tag()) recv.tag <- mpi.get.sourcetag() paste("Received message '", recv.msg, "' from process ", recv.tag[1], \n", sep = "")}

41/48

Rmpi: other communicationfunctions

Low-level

Advanced:

·

Send: mpi.send()

Receive: mpi.recv()

-

-

·

Scatter: mpi.scatter()

Gather: mpi.gather()

Reduce: mpi.reduce()

-

-

-

42/48

Rmpi example: MC part 1π

# Function to calculate whether a point is in the unit circlein.ucir <- function(x) { as.integer((x[, 1]^2 + x[, 2]^2) <= 1)}# Function to generate n.pts random points in the unit square and count the# number in the unit circlecount.in.cir <- function(n.pts) { # Create a list of n.pts random (x,y) pairs m <- matrix(runif(n.pts * 2), n.pts, 2) # Determine whether each point is in unit circle in.cir <- in.ucir(m) # Count the points in the unit circle return(sum(in.cir))}# Send variables and functions to slavesmpi.bcast.Robj2slave(n.pts)mpi.bcast.Robj2slave(in.ucir)mpi.bcast.Robj2slave(count.in.cir)

43/48

Rmpi example: MC part 2π

# Call count.in.cir() on slavesmpi.bcast.cmd(n.in.cir <- count.in.cir(n.pts))# Call count.in.cir() on mastern.in.cir <- count.in.cir(n.pts)# Use mpi.reduce() to total across all processes Have to do two steps# (slaves, master) to avoid hangmpi.bcast.cmd(mpi.reduce(n.in.cir, type = 1, op = "sum"))n.in.cir <- mpi.reduce(n.in.cir, type = 1, op = "sum")# pi is roughly 4*proportion of points in the circlepi.approx <- 4 * n.in.cir/(mpi.comm.size() * n.pts)

44/48

Rmpi: MC

Notes:

π

Generate and analyze data in each process

Use mpi.reduce() to sum up results

·

minimize size of messages

minimize frequency of message passing

-

-

·

45/48

pbdR

"Programming with Big Data in R"

designed for HPC

·

·

46/48

pbdR: Components

MPI: pbdMPI MPI SPMD-style interface

Distributed Linear Algebra and Statistics:

pbdNCDF4 interface to NetCDF4 file formats

pbdML: machine learning

Profiling: pbdPROF, pbdPAPI, hpcvis

·

·

pbdSLAP

pbdBASE

pbdDMAT

-

-

-

·

·

·

47/48

pbdMPI example:

Looks like normal mpi call externally:

mpirun -np 8 Rscript mcpi_pbdr.r

library(pbdMPI, quiet = TRUE)init()n.pts <- 1e+06in.ucir <- function(x) { as.integer((x[, 1]^2 + x[, 2]^2) <= 1)}count.in.cir <- function(n.pts) { m <- matrix(runif(n.pts * 2), n.pts, 2) in.cir <- in.ucir(m) return(sum(in.cir))}# Call count.in.cir on each processn.in.cir <- count.in.cir(n.pts)# Use reduce() to total across processesn.in.cir <- reduce(n.in.cir, op = "sum")pi.approx <- 4 * n.in.cir/(comm.size() * n.pts)finalize()

48/48

Documents

Parallel R - Virginia Tech...Introduction If you need your code to go faster, you have a few options: Be a more efficient programmer Port your code to C/C++/Fortran Parallelize, ie