HPAT presentation at JuliaCon 2016

1

HPAT.jl - Easy and Fast Big Data AnalyticsEhsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana ShpeismanParallel Computing Lab, Intel Labs*Intern from UIUCJuliaCon 2016

2

High Performance Analytics Toolkit (HPAT): a compiler-based framework for big data analytics and machine learning• Goal: efficient large-scale analytics without sacrificing

programmer productivity• Array-style programming• High performance

• Built on ParallelAccelerator• Domain-specific compiler heuristics for parallelization• Use efficient HPC stack (e.g. MPI)

• Bridge the enormous productivity-performance gap

HPAT overview

HPAT.jlhttps://github.com/IntelLabs/HPAT.jl

https://github.com/IntelLabs/HPAT.jl

3

Logistic regression example:

Let’s do Data Science

function logistic_regression(iterations) points = …some small data… responses = … D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return wend

gepsoft.com

4

• Challenges:• Long execution time• Data doesn’t fit in memory

• Solution:• Parallelism: cluster or cloud

• How:• MPI/C++ (”gold standard”), MPI/Julia, Spark (library)• HPAT (“smart compiler”)

What about large data?

udel.edu

5

MPI/C++ (“gold standard”)

herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1); // set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1); // open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL); hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double)); // create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1);

...

• Message Passing Interface (MPI)

• Pros:• Best performance!

• Cons:• Need to understand

parallelism, MPI, parallel I/O, C++

• High effort, tedious, error-prone, not readable…

Example code (not readable):

6

MPI/Julia• MPI.jl• Pros:

• Less effort than C++• Cons:

• Still need to understand parallelism, MPI

• Parallel I/O• Needs high performance Julia• Infrastructure challenges

Example code:

function main() MPI.Init() … MPI.Allreduce!(send_arr, recv_arr, MPI.SUM, MPI.COMM_WORLD) … MPI.Finalize()end

7

Apache Spark• Map/reduce library

• Master-executer model• Pros:

• Easier than MPI/C++• Lots of “system” features

• Cons:• Development effort• Very slow

• 100x slower than MPI/C++• Host language overheads, loses

locality, high library overheads, etc.

Infoobjects.com

Example code (Python):

if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] X = matrix[:, 1:] return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)

def add(x, y): x += y return x

for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add)

8

HPAT (“smart compiler”)• Julia code → MPI/C++ “gold

standard”• “Smart parallelizing compiler”

doesn’t exist, but…• Observations:

• Array code is implicitly parallel• Parallelism is simple for data

analytics• map/reduce pattern• 1D decomposition, allreduce

communication


points

labels

1D decomposition

points

labels

w

w


9

using HPAT@acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return wend

weights = logistic_regression(100,”mydata.hdf5”)

$ mpirun –np 64 julia logistic_regression.jl

Logistic Regression (HPAT)

https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl

95x speedup over Spark

Parallel I/O



10

double* logistic_regression(int64_t iterations, char* file){ int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank (N/mpi_nprocs);∗ int myend = (mpi_rank+1) (N/mpi_nprocs);∗ double *points = (double*)malloc((myend-mystart)*D*sizeof(double)); … ret = H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, …); ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, …);

for(i=mystart ; i<myend ; i++) {…w_local = …

}

MPI_Allreduce(w, w_local, D , MPI_DOUBLE, MPI_SUM, 0 , MPI_COMM_WORLD); return w;}

Logistic Regression (generated)

Initialization, partitioning, allocation

Parallel I/O

Computation

Parameter synchronization

“Bare-metal” MPI/C++ code with near-zero overhead!

11

HPAT usage• Dependencies:

• ParallelAccelerator, Parallel HDF5, MPI

• “@acc hpat” function annotation• Use matrix/vector operations,

comprehensions• ParallelAccelerator operations• No “bad” for loops

• Column-major matrices• All of program inside HPAT

• I/O using DataSource


points

labels

1D decomposition

points

labels

w

w


12

HPAT limitations• HPAT can fail to parallelize!

• Limitations in compiler analysis• Needs good coding style

• Fallback: explicit map/reduce, @parfor code• Only map/reduce parallel pattern supported

• Data analytics, machine learning, optimization etc.• Others like stencils (PDEs) not supported yet

• No sparse matrices yet• HDF5 and text file format



13

Init(state) # assume all arrays are partitionedwhile isChanged(state)

inferArrayDistribution(state, node)endfunction inferArrayDistribution(state, node)

if isAssignment(node)# lhs and rhs are sequential if either is sequentialseq = isSeq(state, lhs) || isSeq(state, rhs)isSeq(state, lhs) = isSeq(state, rhs) = seq

elseif isGEMM(node) # e.g. w = labels*points’ - shared parameter synchronization

heuristicisSeq(lhs) = !isSeq(in1) && !isSeq(in2) && !isTransposed(in1) &&

isTransposed(in2)elseif isHPATcall(node)

handleHPATcall(node)else

# unknown call, assume all arrays are sequentialisSeq(state, nodeArrs) = true

end

Array Distribution Inference

points

labels

w

14

HPAT Compiler Pipeline

Macro-Pass Domain-Pass

Distributed-Pass

HPAT Code Generation (MPI)

Domain-IR

Julia source

Parallel-IR CGen

Julia Compiler

MPI/C++ source

Backend Compiler (ICC/GCC)

15

• MacroPass• “desugar” extensions like DataSource

• DomainPass• Generate variables, allocations, and function calls• Enable ParallelAccelerator pipeline

• DistributedPass• Infer partitioned vs. sequential arrays• Divide allocations, parallelize I/O, and computation• “hook” distributed-memory libraries

• Backend code generation• MPI code extension for CGen

HPAT Compiler Pipeline

Macro-Pass

Domain-Pass

Distributed-Pass

HPAT Code Generation (MPI)

16

HPAT vs. Spark

1D Sum 1D Sum Fi l ter Monte Carlo P i

Logisti c Regression

K-Means0.01

0.1

1

10

100

1000

10000

47 43

84

1061

182

1.55

0.69

0.05

11.1

3

7.95

Spark HPAT

Exe

cutio

n tim

e (S

)

1680x95x

23x30x 62x

Cori at NERSC/LBL64 nodes (2048 cores)

17

Strong Scaling- Logistic Regression

64 (2k) 128 (4k) 256 (8k)1

10

100

1000

10000

1061800 634

11.135.89

4.11

Scaling of Spark vs. HPATSpark HPAT

Nodes (Cores)

Exe

cutio

n Ti

me

(s)

95x 154x

...

kmeans::init::Distributed<step1Local,double,kmeans::init::randomDense> localInit(nClusters, nBlocks * nVectorsInBlock, rankId * nVectorsInBlock);localInit.input.set(kmeans::init::data, dataSource.getNumericTable());localInit.compute();services::SharedPtr<byte> serializedData;InputDataArchive dataArch;localInit.getPartialResult()->serialize( dataArch );size_t perNodeArchLength = dataArch.getSizeOfArchive();if (rankId == mpi_root) {serializedData = services::SharedPtr<byte>( new byte[ perNodeArchLength * nBlocks ] );}byte *nodeResults = new byte[ perNodeArchLength ];dataArch.copyArchiveToArray( nodeResults, perNodeArchLength );MPI_Gather( nodeResults, perNodeArchLength, MPI_CHAR, serializedData.get(), perNodeArchLength, MPI_CHAR, mpi_root, MPI_COMM_WORLD);….

using HPAT@acc hpat function calcKmeans(k, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) clusters = HPAT.Kmeans(points, k) return clustersend

18

Machine Learning Libraries

Intel® DAAL as backend(not readable)

19

• Compiler development experience• The good:

• Built-in type inference• Interospection/full control

• The bad:• No detailed AST definition

• Julia compiler surprises• Long HPAT/ParallelAccelerator compilation time!

• Type inference every time

Building HPAT in Julia

20

• Structured data processing without SQL!• Complex analytics all in array syntax• Inspired by TPCx-BB examples

• Array syntax for table operations• Join, filter, aggregate• Similar to DataFrames.jl

• Interesting compiler challenges• Optimize general AST instead of SQL trees

• Other use cases• 2D decomposition

Ongoing HPAT development

customer_i_class = aggregate(sale_items, :ss_customer_sk,

:ss_item_count = length(:ss_item_sk), :id1 = sum(:i_class_id==1), :id2 = sum(:i_class_id==2),

:id3 = sum(:i_class_id==3), :id4 = sum(:i_class_id==4))

Example:

21

Summary• High Performance Analytics Toolkit (HPAT) provides

scripting abstractions and “bare-metal” performance• Matrix/vector operations, extension for Parallel I/O• Domain-specific compiler techniques• Generates efficient MPI/C++• Uses existing HPC libraries

• Much easier and faster than alternatives• Get involved!

• Any contributions welcome• Need more and more use cases



22

Backup

23

• Goal: gain new insight from large datasets by domain experts

• Productivity is 1st priority• Scripting languages most common, fast development

• MPI/C++ not acceptable• Apache Hadoop and Spark dominant

• Intuitive user interface (MapReduce, Python)• Master-executer library approach

• Library approach is slow• Loses locality, high overheads• Orders of magnitude slower than handwritten MPI/C++

Big Data Analytics is Slow

Infoobjects.com

http://hadoop.apache.org/

F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.

24

• Goal: gain new insight from large datasets by domain experts

• Productivity is 1st priority• Scripting languages most common, fast development

• MPI/C++ not acceptable• Apache Hadoop and Spark dominant

• Intuitive user interface (MapReduce, Python)• Master-executer library approach

• Library approach is slow• Loses locality, high overheads• Orders of magnitude slower than handwritten MPI/C++

Big Data Analytics is Slow

Infoobjects.com

http://hadoop.apache.org/

F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.

25

• 1D partitioning of data per nodes• Domain-specific heuristic for machine learning, analytics• Not valid for other domains!

• Column partitioning for matrices• Julia is column major• Needs good coding style by user

• 1D partitioning of parfor iterations per node• Handle distributed-memory libraries

• Generate necessary input/output transformations• Intel® Data Analytics Acceleration Library (Intel® DAAL)

• Machine learning algorithms similar to Spark’s MLlib

DistributedPass: parallelization

26

Example: Distributed Data Source Transformation Flowpoints = DataSource(Matrix{Float64},HDF5,"/points",“data.hdf5")

points::Matrix{Float64} = HPAT_h5_source("/points",“data.hdf5")

h5_size1, h5_size2 = HPAT_h5_sizes("/points",“data.hdf5")points = alloc(h5_size1, h5_size2)HPAT_h5_read(points, "/points",“data.hdf5")

# Enable Julia type inference

# Enable ParallelAccelerator

points = alloc(h5_size1, h5size_2/nprocs)HPAT_h5_read(points,"/points",“data.hdf5“,h5_size1,h5size_2/nprocs)

# Parallelize

H5Sselect_hyperslab(h5_size1, h5size_2/nprocs,…)H5Dread(points,"/points",“data.hdf5“,…)

# Backend code

Marco-Pass

Domain-Pass

Distributed-Pass

HPAT-CGen

27

• Parallel I/O example (HDF5, MPI/C++):

Backend Code Complexity

herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1);

// set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1);

// open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL);

hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double));

// create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1);

/* create a memory dataspace independently */ hid_t mem_dataspace = H5Screate_simple (1, &my_sptprice_size, NULL); assert (mem_dataspace != -1);

/* set up the collective transfer properties list */ hid_t xfer_plist = H5Pcreate (H5P_DATASET_XFER); assert(xfer_plist != -1); ret = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE); assert(ret != -1);

/* read data collectively */ ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, mem_dataspace, my_dataspace, xfer_plist, my_sptprice); assert(ret != -1); ...

Hard to write manually!

28

Libraries

K- means ( l ib ) L inear Regress ion ( l ib ) Naïve Bayes ( l ib )

59

176

43

21

26 23

Spark-Mllib HPAT-DAAL

Exe

cutio

n tim

e (S

)

2.8x 6.7x

1.9x

29

Benchmarks• Cori at LBL/NERSC

• Dual Haswell nodes• Cray Aries (Dragonfly) network• 64 nodes (2048 cores) used• Spark 1.6.0 (default Cori installation)

• Benchmarks• 1D_sum: sums 8.5 billion element vector from file• Pi: 1 billion random points• Logistic regression: 2 billion samples,10 features SP• K-Means: 320 million 20-feature DP, 10 iterations, 5 centers

30

D = 10 # Number of dimensions

if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] # point labels (first column of input file) X = matrix[:, 1:] # point coordinates return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)


for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add)

Logistic Regression (Spark)

https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py

Scheduling, TCP/IP overheads

Python overheads

https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py

31

using HPAT@acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D)

for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return wend

weights = logistic_regression(100,”mydata.hdf5”)

$ mpirun –np 64 julia logistic_regression.jl

Logistic Regression (HPAT)


95x speedup!



32

from pyspark import SparkContext

if __name__ == "__main__": sc = SparkContext(appName="PythonPi") n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 – 1 return 1 if x ** 2 + y ** 2 < 1 else 0


count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) sc.stop()

Monte Carlo Pi (Spark)

https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

Scheduling overheads

Extra array

map

reduce

https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

33

double calcPi(int64_t N) { int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank (N/mpi_nprocs);∗ int myend = (mpi_rank+1) (N/mpi_nprocs);∗ for(i=mystart ; i<myend ; i++) {

x = rand(..);y = rand(..);sum_local += …

} MPI_Reduce(…); return out;}

using HPAT@acc hpat function calcPi(n) x = rand(n) .* 2.0 .- 1.0 y = rand(n) .* 2.0 .- 1.0 return 4.0*sum(x.^2 .+ y.^2 .< 1.0)/nend

myPi = calcPi(10^9)

$ mpirun –np 64 julia pi.jl

Monte Carlo Pi (HPAT)

https://github.com/IntelLabs/HPAT.jl/blob/master/examples/pi.jl

1600x speedup!

Computation done in registers



Technology

HPAT presentation at JuliaCon 2016