24
Large-Scale Geographically Weighted Regression on Spark Hung Tien Tran, Hiep Tuan Nguyen, Viet- Trung Tran Hanoi University of Science and Technology

Large-Scale Geographically Weighted Regression on Spark

Embed Size (px)

Citation preview

Slide 1

Large-Scale Geographically Weighted Regression on SparkHung Tien Tran, Hiep Tuan Nguyen, Viet-Trung TranHanoi University of Science and Technology

IntroductionWhat is Geographically Weighted Regression?

What is our work?

Source: http://desktop.arcgis.com

GWR += Large-scale spatial data Improve performance Distributed

OutlineBackgroundProblemScalable GWR on SparkExperimentsDiscussionConclusion

BackgroundFirst Law of Geography - Waldo Tobler: Everything is related with everything else, but closer things are more related.Model GWR

The OLS estimator takes the form

yi (u) = 0i (u) + 1i (u)x1i +2i (u)x2i + ... + mi (u)xmi

(u) = (X TW (u)X )1 X TW (u)Y

BackgroundKernel function Gaussian function

Bandwidth

5

fixed bandwidthadaptive bandwidth

5

ProblemEstimating a local model

Bandwidth selection

Evaluation modelChoose kernel function(u) = (X TW (u)X )1 X TW (u)Y

Source: http://rose.bris.ac.ukO(n3)Which bandwidth is good

ProblemHow to apply the model for large-scale data?Data pointsFeaturesRegression points

Large-Scale GWR on SparkWhy is Spark?In-memory cluster-computing platformSupport parallel programmingDevelop applications by high-level APIsProvides resilient distributed datasets and parallel operationsIntegration with other components on Spark

Scalability , Performance User-friendly APIs 8

Large-Scale GWR on SparkWe propose three approach to scaling GWRScaling Weighted Linear RegressionParallel Multiple WLR modelsParallel Geographically Weighted Regression (combine the first two approach)

Scalable GWR on SparkNave approach Scaling Weighted Linear RegressionForeach regPointCompute weightFit Weighted Linear Regression

Summary model

Compute weight parallelCompute WLR model parallel

Scalable GWR on SparkNave approach

Scalable GWR on SparkParallel Multiple WLR modelsRegression datasetTraining dataset

WLR

Compute weightWLRCompute parallel multiple WLR models

Summary

Scalable GWR on SparkParallel Multiple WLR models

Scalable GWR on SparkParallel Geographically Weighted Regression

RRR

TTT

RTRTRT

Regression dataset

Training dataset

Combine dataset

Distributed GWR Computation

Scalable GWR on SparkParallel Geographically Weighted Regression

Scalable GWR on SparkParallel Geographically Weighted Regression

ExperimentsEnvironmentCluster: 8 nodes on Amazon Web Service4 cores Inte Xeon E5-2670 v2 2.5 GHz16 GB RAM, 2x40 GB SSDHadoop 2.7.2 and Spark 1.6.1Dataset| x : double(nullable = false)| y : double(nullable = false)| label : double(nullable = false)| f eatures : vector(nullable = false)

ExperimentsTesting large training datasettime (sec).Number of training points

ExperimentsTesting large regression dataset

time (sec).Number of regression points

ExperimentsTesting large dataset with increasing number of features

time (sec).Number of regression points

20

ExperimentsClustertime (sec).Number of nodes

DiscussionRelated workMany library GWR on localSpgwr (multiR on GRID)Using GPUOur workFirst study distributed GWR on SparkEasy deployment and the advantages of SparkScalable and work well on cluster

ConclusionWe havePropose three approachImplement four algorithms base on SparkEvaluate our implementationFuture workImprove performance by using Pipeline and PartitionsRelease as open-source library

THANK YOU