SATSCAN VERSUS DBSCAN

Statistical Significance

SATSCAN VERSUS DBSCANPresented by:GROUP 7Gayathri Gandhamuneni &Yumeng WangAGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsValidationResultsConclusionsFuture WorkAGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsValidationResultsConclusionsFuture WorkPROBLEM STATEMENTInput: Two different Clustering algorithms (DBScan, SatScan)Same Input DatasetCriteria of ComparisonOutput: Result of Comparison Data / GraphConstraints:DBScan No data about efficiencySatScan Software 1 pre defined shapeObjective:Usage Scenarios Which algorithm can be used where?

AGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsValidationResultsConclusionsFuture Work MOTIVATION/NOVELTYDifferent clustering algorithmsCategorized into different types

Comparisons Algorithms - Same categoryNo Systematic way of comparison, Biased ComparisonsNo situation based comparison Which to use where?No comparison betn. DBScan & SatScan

AGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsValidationResultsConclusionsFuture WorkRELATED WORKComparison of Clustering AlgorithmsSame type of AlgorithmsDifferent type of AlgorithmsDensity Based DBScan & OPTICSDBScan Vs K-MeansOur Work DBSCan Vs SatScanDensity Based DBScan & SNNK-means (Centroid Based) Vs Hierarchical, Expectation Vs Maximization (Distance Based)AGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsValidationResultsConclusionsFuture WorkPROPOSED APPROACHOur Approach:Two different types of Clustering algorithmsDBScanSatScanUnbiased comparisonSystematic 3 factors & Same Input datasetsShape of the clusterStatistical SignificanceScalabilityAGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsChallengesValidationResultsConclusionsFuture WorkKEY CONCEPTS - 1 ClusteringTask of grouping a set of objects in such a way that objects in the same group (called acluster) are more similar to each other than to those in other groups.Data Mining, Statistical Analysis & many more fields

Real world Application:Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zonesField Robotics: For robotic situational awareness to track objects and detect outliers in sensor data

KEY CONCEPTS - 2

Types of Clustering AlgorithmsConnectivity based / HierarchicalCentroid BasedDistribution BasedDensity BasedCore Idea - Objects being more related to nearby objects (distance) than to objects farther awayCore Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data setEx: K - MeansCore Idea - Clusters can be defined as objects belonging most likely to the same distributionCore Idea - Clusters are areas of higher density than the remainder of the data setCore Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data setEx: K - MeansKEY CONCEPTS - DBSCANDensity based ClusteringArgumentsMinimum number of Points MinPtsRadius - EpsDensity = Number of Points within specified radius (Eps)

Three types of Points Core Point No. of points > MinPts within EpsBorder point No. of Points < MinPts within Eps but is in neighborhood of a core pointNoise point - Neither a core point nor a border point

EXAMPLE - DBSCANDataset 1 :

DBSCAN RESULTS - 1

DB Scan o/p on dataset1: Min-Neighbors=3, Radius = 5Number of Clusters = 36 DBSCAN RESULTS - 2

DB Scan o/p on dataset1: Min-Neighbors=7, Radius = 1Number of clusters = 0 DBSCAN RESULTS - 3

DB Scan o/p on dataset1: Min-Neighbors=20,Radius = 20Number of clusters = 4KEY CONCEPTS - 3SaTScan Spatial Scan StatisticsInput:Datasetnull hypothesis modelProcedure:Pre-defined shape scanning windowVariating size of the windowCalculate likelyhood ratio => Most Likely clustersTest statistical significance (Monte Carlo Sampling, 1000 runs)Output:Clusters with p-value

Significant/primaryInsignificant/secondaryAGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsChallengesValidationResultsConclusionsFuture WorkCHALLENGESTuning parameters - DBScanManual tuning to detect clustersHard to set correct parameters

Design of appropriate DatasetsTo demonstrate Criteria of Comparison

AGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsChallengesValidationResultsConclusionsFuture WorkVALIDATIONExperimentAssumptions based on theoryDesigning datasets and running experiment Able to validate them with resultsAGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsChallengesValidationResultsConclusionsFuture WorkCLUSTER SHAPE - DBSCAN Vs SatScan

CLUSTER SHAPE - DBSCAN Vs SatScan

STATISTICAL SIGNIFICANCE

CSR Dataset -1000 pointsSTATISTICAL SIGNIFICANCECSR Dataset - 2000 points

RUNTIME Number of Points - DBScan

RUNTIME Number of Points - SATScanRUNTIME Number of Clusters DB Vs SAT

Datasets: 3000 pointsSAMECLUSTERS!!RUNTIME Number of Clusters DBScan

RUNTIME Number of Clusters SATScan AGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsChallengesValidationResultsConclusionsFuture Work CONCLUSIONSS.NoFactor of Comparison DBSCAN SATSCAN1Number of clusters not known beforehand Yes Yes2Shape: Data has different shaped clusters Yes No - Only 1 shape of clusters(Circle, ellipse, rectangle.. )3Runtime: How much time to form clusters?Less runtimeMore runtimeIterative approach to detect clusters and Monte Carlo Sampling too4Scalability: How well it scales when data size is increasedStill manageable runtime - Curse of dimensionalityRuntime Size, Number of clusters5Statistical Significance: How significant are the clusters detected? No significance factorSignificance is at the core6Noise: Is noise allowed or should all points be in cluster? Yes YesAGENDAProblem StatementMotivation / NoveltyRelated Work & Our ContributionsProposed ApproachKey ConceptsChallengesValidationResultsConclusionsFuture WorkFUTURE WORKSame project Real World DatasetsRun more instances of the experimentsControl over parametersCompare with other types of clustering algorithms

QUESTIONS?

BACKUP SLIDE - 1DBSCAN requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbor points within distance eps of the starting point.If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.If the number of neighbors is less than minPts, the point is marked as noise.If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.BACKUP SLIDE 2 -CONCLUSIONSDBScan WorksSame density clustersDont know the number of clusters beforehandDifferent shaped clustersAll points need not be in clusters Noise concept is present

DBScan doesnt workVarying density clustersQuality of DBScan depends on Epsilon If Euclidean distanceHigh dimension data Curse of dimensionality

TO DO CLUSTER SHAPE - DBSCAN Results

SHAPE - SATSCAN RESULTS

p-value: 0.001

Documents

SATSCAN VERSUS DBSCAN