Upload
viet-trung-tran
View
96
Download
9
Embed Size (px)
Citation preview
Slide 1
Large-Scale Geographically Weighted Regression on SparkHung Tien Tran, Hiep Tuan Nguyen, Viet-Trung TranHanoi University of Science and Technology
IntroductionWhat is Geographically Weighted Regression?
What is our work?
Source: http://desktop.arcgis.com
GWR += Large-scale spatial data Improve performance Distributed
OutlineBackgroundProblemScalable GWR on SparkExperimentsDiscussionConclusion
BackgroundFirst Law of Geography - Waldo Tobler: Everything is related with everything else, but closer things are more related.Model GWR
The OLS estimator takes the form
yi (u) = 0i (u) + 1i (u)x1i +2i (u)x2i + ... + mi (u)xmi
(u) = (X TW (u)X )1 X TW (u)Y
BackgroundKernel function Gaussian function
Bandwidth
5
fixed bandwidthadaptive bandwidth
5
ProblemEstimating a local model
Bandwidth selection
Evaluation modelChoose kernel function(u) = (X TW (u)X )1 X TW (u)Y
Source: http://rose.bris.ac.ukO(n3)Which bandwidth is good
ProblemHow to apply the model for large-scale data?Data pointsFeaturesRegression points
Large-Scale GWR on SparkWhy is Spark?In-memory cluster-computing platformSupport parallel programmingDevelop applications by high-level APIsProvides resilient distributed datasets and parallel operationsIntegration with other components on Spark
Scalability , Performance User-friendly APIs 8
Large-Scale GWR on SparkWe propose three approach to scaling GWRScaling Weighted Linear RegressionParallel Multiple WLR modelsParallel Geographically Weighted Regression (combine the first two approach)
Scalable GWR on SparkNave approach Scaling Weighted Linear RegressionForeach regPointCompute weightFit Weighted Linear Regression
Summary model
Compute weight parallelCompute WLR model parallel
Scalable GWR on SparkNave approach
Scalable GWR on SparkParallel Multiple WLR modelsRegression datasetTraining dataset
WLR
Compute weightWLRCompute parallel multiple WLR models
Summary
Scalable GWR on SparkParallel Multiple WLR models
Scalable GWR on SparkParallel Geographically Weighted Regression
RRR
TTT
RTRTRT
Regression dataset
Training dataset
Combine dataset
Distributed GWR Computation
Scalable GWR on SparkParallel Geographically Weighted Regression
Scalable GWR on SparkParallel Geographically Weighted Regression
ExperimentsEnvironmentCluster: 8 nodes on Amazon Web Service4 cores Inte Xeon E5-2670 v2 2.5 GHz16 GB RAM, 2x40 GB SSDHadoop 2.7.2 and Spark 1.6.1Dataset| x : double(nullable = false)| y : double(nullable = false)| label : double(nullable = false)| f eatures : vector(nullable = false)
ExperimentsTesting large training datasettime (sec).Number of training points
ExperimentsTesting large regression dataset
time (sec).Number of regression points
ExperimentsTesting large dataset with increasing number of features
time (sec).Number of regression points
20
ExperimentsClustertime (sec).Number of nodes
DiscussionRelated workMany library GWR on localSpgwr (multiR on GRID)Using GPUOur workFirst study distributed GWR on SparkEasy deployment and the advantages of SparkScalable and work well on cluster
ConclusionWe havePropose three approachImplement four algorithms base on SparkEvaluate our implementationFuture workImprove performance by using Pipeline and PartitionsRelease as open-source library
THANK YOU