STING: A Statistical Information Grid Approach to Spatial Data

Preview:

Citation preview

STING: A Statistical Information Grid STING: A Statistical Information Grid Approach to Spatial Data MiningApproach to Spatial Data Mining

Presentation 2(Group 14)Presentation 2(Group 14)

CSE 590 Data MiningProf. Anita Wasilewska

SUNY Stony Brook

Presented By:Tejas SomaniNikhil Pujari

STING: A Statistical STING: A Statistical Information Grid Approach to Information Grid Approach to

Spatial Data MiningSpatial Data MiningPaper by:

Wei WangDepartment of Computer

ScienceUniversity of California, Los

AngelesCA 90095, U.S.A.

weiwang@cs.ucla.edu

Jiong Yang

Department of Computer Science

University of California, Los

Angeles

CA 90095, U.S.A.

jyang@cs.ucla.edu

Richard Muntz

Department of Computer Science

University of California, Los

Angeles

CA 90095, U.S.A.

muntz@cs.ucla.edu

VLDB Conference Athens, Greece, 1997VLDB Conference Athens, Greece, 1997

ReferencesReferenceshttp://georges.gardarin.free.fr/Cours_X

MLDM_Master2/Sting.PDFhttp://www.webopedia.com/TERM/S/sp

atial_data.htmlJiawei Han and Michelle Kamber. Data

Mining Concept and Techniques (Chapter8). Morgan Kaufman, 2002

Using Grid-clustering Methods in Data Classification by Peter Grabusts and Arkady Borisov @Riga Technical University

What is Spatial Data??What is Spatial Data??Spatial data may be thought of as features

located on or referenced to the Earth's surface, such as roads, streams, political boundaries, schools, land use classifications, property ownership parcels, drinking water intakes, pollution discharge sites - in short, anything that can be mapped.

Spatial Area: The area that encompasses the locations of

all the spatial data is called spatial area.

http://www.webopedia.com/TERM/S/spatial_data.html

STING The OverviewSTING The Overview

• STING is a grid based method to efficiently process many common region oriented queries on a set of points

• A set of points satisfying some criterion defines a Region

• It is a hierarchical Method. The idea is to capture statistical information associated with spatial cells in such a manner that the whole classes of queries can be answered without referring to the individual objects.

We want to cluster the records that are in a spatial table in terms of location.

Placement of a record in a grid cell is completely determined by its physical location.

http://georges.gardarin.free.fr/Cours_XMLDM_Master2/Sting.PDF

Grid Cell HierarchyGrid Cell Hierarchy

Spatial Area is divided into rectangular cells

Each cell has a hierarchical structure.

Each cell at a higher level is partitioned into

number of cells of the next lower level (here

4)i.e., A cell in level i corresponds to the union

of the areas of its children at level i + 1The size of the leaf level cells is dependent

on the density of objects.http://georges.gardarin.free.fr/Cours_XMLDM_Master2/Sting.PDF

Hierarchical Structure for STING Hierarchical Structure for STING ClusteringClustering

Data Mining: Concepts and Techniques by by Jiawei Han, Micheline Kamber

Statistical ParametersStatistical ParametersFor each cell we have attribute-dependent

and attribute-independent parametersThe attribute independent parameter is

number of objects in a cell-nFor attribute dependent parameters it is

assumed that for each object its attributes have numerical values.

For each Numerical attribute we have the following five parameters

Statistical Parameters..Statistical Parameters..m- mean of all values in this cells- standard deviation of all values

in this cellmin-the minimum value of the

attribute in this cellmax-the minimum value of the

attribute in this celldistribution-the type of

distribution this cell follows. Data Mining: Concepts and Techniques by by Jiawei Han, Micheline Kamber

Statistical Parameters..Statistical Parameters..Statistical information regarding the

attributes in each grid cell, for each layer are pre-computed and stored before hand.

The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table, when data are loaded into the database.

The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.

Query Types and Query Query Types and Query ProcessingProcessing1)Query Types SQL like Language used to describe queries Two types of common queries found: one is to

find region specifying certain constraints and other take in a region and return some attribute of the region

2) Query Processing:We use a top-down approach to answer

spatial data queries.

Start from a pre-selected layer-typically with a small number of cells.

Query Processing..Query Processing..

The pre-selected layer does not have to be the top most layer.

For each cell in the current layer compute the confidence interval (or estimated range of probability) reflecting the cells relevance to the given query

The confidence interval is calculated by using the statistical parameters of each cell.

From the interval calculated we label the cells as relevant or irrelevant for this query

Remove irrelevant cells from further consideration.

Query Processing..Query Processing.. When finished with the current layer, proceed to

the next lower level.

Processing of the next lower level examines only the remaining relevant cells.

Repeat this process until the bottom layer is reached.

At this time if query specifications are met, the regions of relevant cells that satisfy the query are returned

Otherwise, the data that fall into the relevant cells are retrieved and further processed until they meet the requirement of the query

Different Grid Levels during Different Grid Levels during Query ProcessingQuery Processing

http://georges.gardarin.free.fr/Cours_XMLDM_Master2/Sting.PDF

Finally..Finally..Strength and Weakness of Strength and Weakness of STINGSTINGStrength:Grid structure facilitates parallel processing and

incremental updating Is very efficient as the computational cost is

O(g) where g is the total number of grid cells at the lowest level (much smaller than n, total number of objects)

Is query independent as statistical information stored in cells is summary information of data

Weakness:All Cluster boundaries are either horizontal or

vertical, and no diagonal boundary is selected.

Thank You

All the BEST for FINALS!!!

Recommended