Upload
ehsan-totoni
View
353
Download
1
Embed Size (px)
Citation preview
1
HPAT.jl - Easy and Fast Big Data AnalyticsEhsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana ShpeismanParallel Computing Lab, Intel Labs*Intern from UIUCJuliaCon 2016
2
High Performance Analytics Toolkit (HPAT): a compiler-based framework for big data analytics and machine learning• Goal: efficient large-scale analytics without sacrificing
programmer productivity• Array-style programming• High performance
• Built on ParallelAccelerator• Domain-specific compiler heuristics for parallelization• Use efficient HPC stack (e.g. MPI)
• Bridge the enormous productivity-performance gap
HPAT overview
HPAT.jlhttps://github.com/IntelLabs/HPAT.jl
3
Logistic regression example:
Let’s do Data Science
function logistic_regression(iterations) points = …some small data… responses = … D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return wend
gepsoft.com
4
• Challenges:• Long execution time• Data doesn’t fit in memory
• Solution:• Parallelism: cluster or cloud
• How:• MPI/C++ (”gold standard”), MPI/Julia, Spark (library)• HPAT (“smart compiler”)
What about large data?
udel.edu
5
MPI/C++ (“gold standard”)
herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1); // set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1); // open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL); hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double)); // create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1);
...
• Message Passing Interface (MPI)
• Pros:• Best performance!
• Cons:• Need to understand
parallelism, MPI, parallel I/O, C++
• High effort, tedious, error-prone, not readable…
Example code (not readable):
6
MPI/Julia• MPI.jl• Pros:
• Less effort than C++• Cons:
• Still need to understand parallelism, MPI
• Parallel I/O• Needs high performance Julia• Infrastructure challenges
Example code:
function main() MPI.Init() … MPI.Allreduce!(send_arr, recv_arr, MPI.SUM, MPI.COMM_WORLD) … MPI.Finalize()end
7
Apache Spark• Map/reduce library
• Master-executer model• Pros:
• Easier than MPI/C++• Lots of “system” features
• Cons:• Development effort• Very slow
• 100x slower than MPI/C++• Host language overheads, loses
locality, high library overheads, etc.
Infoobjects.com
Example code (Python):
if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] X = matrix[:, 1:] return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y): x += y return x
for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add)
8
HPAT (“smart compiler”)• Julia code → MPI/C++ “gold
standard”• “Smart parallelizing compiler”
doesn’t exist, but…• Observations:
• Array code is implicitly parallel• Parallelism is simple for data
analytics• map/reduce pattern• 1D decomposition, allreduce
communication
HPAT.jlhttps://github.com/IntelLabs/HPAT.jl
points
labels
1D decomposition
points
labels
w
w
9
using HPAT@acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return wend
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup over Spark
Parallel I/O
10
double* logistic_regression(int64_t iterations, char* file){ int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank (N/mpi_nprocs);∗ int myend = (mpi_rank+1) (N/mpi_nprocs);∗ double *points = (double*)malloc((myend-mystart)*D*sizeof(double)); … ret = H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, …); ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, …);
for(i=mystart ; i<myend ; i++) {…w_local = …
}
MPI_Allreduce(w, w_local, D , MPI_DOUBLE, MPI_SUM, 0 , MPI_COMM_WORLD); return w;}
Logistic Regression (generated)
Initialization, partitioning, allocation
Parallel I/O
Computation
Parameter synchronization
“Bare-metal” MPI/C++ code with near-zero overhead!
11
HPAT usage• Dependencies:
• ParallelAccelerator, Parallel HDF5, MPI
• “@acc hpat” function annotation• Use matrix/vector operations,
comprehensions• ParallelAccelerator operations• No “bad” for loops
• Column-major matrices• All of program inside HPAT
• I/O using DataSource
HPAT.jlhttps://github.com/IntelLabs/HPAT.jl
points
labels
1D decomposition
points
labels
w
w
12
HPAT limitations• HPAT can fail to parallelize!
• Limitations in compiler analysis• Needs good coding style
• Fallback: explicit map/reduce, @parfor code• Only map/reduce parallel pattern supported
• Data analytics, machine learning, optimization etc.• Others like stencils (PDEs) not supported yet
• No sparse matrices yet• HDF5 and text file format
HPAT.jlhttps://github.com/IntelLabs/HPAT.jl
13
Init(state) # assume all arrays are partitionedwhile isChanged(state)
inferArrayDistribution(state, node)endfunction inferArrayDistribution(state, node)
if isAssignment(node)# lhs and rhs are sequential if either is sequentialseq = isSeq(state, lhs) || isSeq(state, rhs)isSeq(state, lhs) = isSeq(state, rhs) = seq
elseif isGEMM(node) # e.g. w = labels*points’ - shared parameter synchronization
heuristicisSeq(lhs) = !isSeq(in1) && !isSeq(in2) && !isTransposed(in1) &&
isTransposed(in2)elseif isHPATcall(node)
handleHPATcall(node)else
# unknown call, assume all arrays are sequentialisSeq(state, nodeArrs) = true
end
Array Distribution Inference
points
labels
w
14
HPAT Compiler Pipeline
Macro-Pass Domain-Pass
Distributed-Pass
HPAT Code Generation (MPI)
Domain-IR
Julia source
Parallel-IR CGen
Julia Compiler
MPI/C++ source
Backend Compiler (ICC/GCC)
15
• MacroPass• “desugar” extensions like DataSource
• DomainPass• Generate variables, allocations, and function calls• Enable ParallelAccelerator pipeline
• DistributedPass• Infer partitioned vs. sequential arrays• Divide allocations, parallelize I/O, and computation• “hook” distributed-memory libraries
• Backend code generation• MPI code extension for CGen
HPAT Compiler Pipeline
Macro-Pass
Domain-Pass
Distributed-Pass
HPAT Code Generation (MPI)
16
HPAT vs. Spark
1D Sum 1D Sum Fi l ter Monte Carlo P i
Logisti c Regression
K-Means0.01
0.1
1
10
100
1000
10000
47 43
84
1061
182
1.55
0.69
0.05
11.1
3
7.95
Spark HPAT
Exe
cutio
n tim
e (S
)
1680x95x
23x30x 62x
Cori at NERSC/LBL64 nodes (2048 cores)
17
Strong Scaling- Logistic Regression
64 (2k) 128 (4k) 256 (8k)1
10
100
1000
10000
1061800 634
11.135.89
4.11
Scaling of Spark vs. HPATSpark HPAT
Nodes (Cores)
Exe
cutio
n Ti
me
(s)
95x 154x
...
kmeans::init::Distributed<step1Local,double,kmeans::init::randomDense> localInit(nClusters, nBlocks * nVectorsInBlock, rankId * nVectorsInBlock);localInit.input.set(kmeans::init::data, dataSource.getNumericTable());localInit.compute();services::SharedPtr<byte> serializedData;InputDataArchive dataArch;localInit.getPartialResult()->serialize( dataArch );size_t perNodeArchLength = dataArch.getSizeOfArchive();if (rankId == mpi_root) {serializedData = services::SharedPtr<byte>( new byte[ perNodeArchLength * nBlocks ] );}byte *nodeResults = new byte[ perNodeArchLength ];dataArch.copyArchiveToArray( nodeResults, perNodeArchLength );MPI_Gather( nodeResults, perNodeArchLength, MPI_CHAR, serializedData.get(), perNodeArchLength, MPI_CHAR, mpi_root, MPI_COMM_WORLD);….
using HPAT@acc hpat function calcKmeans(k, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) clusters = HPAT.Kmeans(points, k) return clustersend
18
Machine Learning Libraries
Intel® DAAL as backend(not readable)
19
• Compiler development experience• The good:
• Built-in type inference• Interospection/full control
• The bad:• No detailed AST definition
• Julia compiler surprises• Long HPAT/ParallelAccelerator compilation time!
• Type inference every time
Building HPAT in Julia
20
• Structured data processing without SQL!• Complex analytics all in array syntax• Inspired by TPCx-BB examples
• Array syntax for table operations• Join, filter, aggregate• Similar to DataFrames.jl
• Interesting compiler challenges• Optimize general AST instead of SQL trees
• Other use cases• 2D decomposition
Ongoing HPAT development
customer_i_class = aggregate(sale_items, :ss_customer_sk,
:ss_item_count = length(:ss_item_sk), :id1 = sum(:i_class_id==1), :id2 = sum(:i_class_id==2),
:id3 = sum(:i_class_id==3), :id4 = sum(:i_class_id==4))
Example:
21
Summary• High Performance Analytics Toolkit (HPAT) provides
scripting abstractions and “bare-metal” performance• Matrix/vector operations, extension for Parallel I/O• Domain-specific compiler techniques• Generates efficient MPI/C++• Uses existing HPC libraries
• Much easier and faster than alternatives• Get involved!
• Any contributions welcome• Need more and more use cases
HPAT.jlhttps://github.com/IntelLabs/HPAT.jl
22
Backup
23
• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority• Scripting languages most common, fast development
• MPI/C++ not acceptable• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)• Master-executer library approach
• Library approach is slow• Loses locality, high overheads• Orders of magnitude slower than handwritten MPI/C++
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
24
• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority• Scripting languages most common, fast development
• MPI/C++ not acceptable• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)• Master-executer library approach
• Library approach is slow• Loses locality, high overheads• Orders of magnitude slower than handwritten MPI/C++
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
25
• 1D partitioning of data per nodes• Domain-specific heuristic for machine learning, analytics• Not valid for other domains!
• Column partitioning for matrices• Julia is column major• Needs good coding style by user
• 1D partitioning of parfor iterations per node• Handle distributed-memory libraries
• Generate necessary input/output transformations• Intel® Data Analytics Acceleration Library (Intel® DAAL)
• Machine learning algorithms similar to Spark’s MLlib
DistributedPass: parallelization
26
Example: Distributed Data Source Transformation Flowpoints = DataSource(Matrix{Float64},HDF5,"/points",“data.hdf5")
points::Matrix{Float64} = HPAT_h5_source("/points",“data.hdf5")
h5_size1, h5_size2 = HPAT_h5_sizes("/points",“data.hdf5")points = alloc(h5_size1, h5_size2)HPAT_h5_read(points, "/points",“data.hdf5")
# Enable Julia type inference
# Enable ParallelAccelerator
points = alloc(h5_size1, h5size_2/nprocs)HPAT_h5_read(points,"/points",“data.hdf5“,h5_size1,h5size_2/nprocs)
# Parallelize
H5Sselect_hyperslab(h5_size1, h5size_2/nprocs,…)H5Dread(points,"/points",“data.hdf5“,…)
# Backend code
Marco-Pass
Domain-Pass
Distributed-Pass
HPAT-CGen
27
• Parallel I/O example (HDF5, MPI/C++):
Backend Code Complexity
herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1);
// set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1);
// open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL);
hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double));
// create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1);
/* create a memory dataspace independently */ hid_t mem_dataspace = H5Screate_simple (1, &my_sptprice_size, NULL); assert (mem_dataspace != -1);
/* set up the collective transfer properties list */ hid_t xfer_plist = H5Pcreate (H5P_DATASET_XFER); assert(xfer_plist != -1); ret = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE); assert(ret != -1);
/* read data collectively */ ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, mem_dataspace, my_dataspace, xfer_plist, my_sptprice); assert(ret != -1); ...
Hard to write manually!
28
Libraries
K- means ( l ib ) L inear Regress ion ( l ib ) Naïve Bayes ( l ib )
59
176
43
21
26 23
Spark-Mllib HPAT-DAAL
Exe
cutio
n tim
e (S
)
2.8x 6.7x
1.9x
29
Benchmarks• Cori at LBL/NERSC
• Dual Haswell nodes• Cray Aries (Dragonfly) network• 64 nodes (2048 cores) used• Spark 1.6.0 (default Cori installation)
• Benchmarks• 1D_sum: sums 8.5 billion element vector from file• Pi: 1 billion random points• Logistic regression: 2 billion samples,10 features SP• K-Means: 320 million 20-feature DP, 10 iterations, 5 centers
30
D = 10 # Number of dimensions
if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] # point labels (first column of input file) X = matrix[:, 1:] # point coordinates return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y): x += y return x
for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add)
Logistic Regression (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py
Scheduling, TCP/IP overheads
Python overheads
31
using HPAT@acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return wend
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup!
32
from pyspark import SparkContext
if __name__ == "__main__": sc = SparkContext(appName="PythonPi") n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 – 1 return 1 if x ** 2 + y ** 2 < 1 else 0
def add(x, y): x += y return x
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) sc.stop()
Monte Carlo Pi (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py
Scheduling overheads
Extra array
map
reduce
33
double calcPi(int64_t N) { int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank (N/mpi_nprocs);∗ int myend = (mpi_rank+1) (N/mpi_nprocs);∗ for(i=mystart ; i<myend ; i++) {
x = rand(..);y = rand(..);sum_local += …
} MPI_Reduce(…); return out;}
using HPAT@acc hpat function calcPi(n) x = rand(n) .* 2.0 .- 1.0 y = rand(n) .* 2.0 .- 1.0 return 4.0*sum(x.^2 .+ y.^2 .< 1.0)/nend
myPi = calcPi(10^9)
$ mpirun –np 64 julia pi.jl
Monte Carlo Pi (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/pi.jl
1600x speedup!
Computation done in registers