65
Frameworks and Algorithms for Regional Knowledge Discovery Christoph F. Eick Department of Computer Science, University of Houston 1. Motivation: Why is Regional Knowledge Important? 2. Region Discovery Framework 3. A Family of Clustering Algorithms for Region Discovery 4. Case Studies—Extracting Regional Knowledge: Regional Regression Regional Association Rule Mining Regional Models of User Behaviour on the Internet [Co-location Mining] 5. [Analyzing Related Datasets] 6. Summary 1

Frameworks and Algorithms for Regional Knowledge Discovery

  • Upload
    presta

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Frameworks and Algorithms for Regional Knowledge Discovery. Christoph F. Eick Department of Computer Science, University of Houston Motivation: Why is Regional Knowledge Important? Region Discovery Framework A Family of Clustering Algorithms for Region Discovery - PowerPoint PPT Presentation

Citation preview

Page 1: Frameworks and Algorithms for  Regional Knowledge Discovery

Frameworks and Algorithms for Regional Knowledge Discovery

Christoph F. Eick

Department of Computer Science, University of Houston

1. Motivation: Why is Regional Knowledge Important?

2. Region Discovery Framework

3. A Family of Clustering Algorithms for Region Discovery

4. Case Studies—Extracting Regional Knowledge:• Regional Regression• Regional Association Rule Mining• Regional Models of User Behaviour on the Internet• [Co-location Mining]

5. [Analyzing Related Datasets]

6. Summary 1

Page 2: Frameworks and Algorithms for  Regional Knowledge Discovery

Ch. Eick: Regional Knowledge Discovery

Spatial Data MiningSpatial Data Mining• Definition: Spatial data mining is the process of

discovering interesting patterns from large spatial datasets; it organizes by location what is interesting.

• Challenges:– Information is not uniformly distributed– Autocorrelation– Space is continuous– Complex spatial data types– Large dataset sizes and many possible patterns– Patterns exist at different sets level of resolution– Importance of maps as summaries – Importance of regional Knowledge

2

Page 3: Frameworks and Algorithms for  Regional Knowledge Discovery

Ch. Eick: Regional Knowledge Discovery

Why Regional Knowledge Important in Spatial Data Mining?Why Regional Knowledge Important in Spatial Data Mining?

• It has been pointed out in the literature that “whole map statistics are seldom useful”, that “most relationships in spatial data sets are geographically regional, rather than global”, and that “there is no average place on the Earth’s surface” [Goodchild03, Openshaw99].

• Simpson’s Paradox – global models may be inconsistent with regional models [Simpson1951].

• Therefore, it is not surprising that domain experts are mostly interested in discovering hidden patterns at a regional scale rather than a global scale.

3

Page 4: Frameworks and Algorithms for  Regional Knowledge Discovery

Ch. Eick: Regional Knowledge Discovery

Example: Regional Association RulesExample: Regional Association Rules

Rule 1

Rule 3

Rule 2

Rule 4

Scopes of the 4 Rules in

4

Page 5: Frameworks and Algorithms for  Regional Knowledge Discovery

Ch. Eick: Regional Knowledge Discovery

Goal of the Presented ResearchGoal of the Presented Research

Develop and implement an integrated computational framework useful for data analysts and scientists from diverse disciplines for extracting regional knowledge in spatial datasets in a highly automated fashion.

5

Page 6: Frameworks and Algorithms for  Regional Knowledge Discovery

Ch. Eick: Regional Knowledge Discovery

Related Work Related Work

Spatial co-location pattern discovery [Shekhar et al.]Spatial association rule mining [Han et al.] Localized associations in segments of the basket data

[Yu et al.]Spatial statistics on hot spot detection [Tay and

Brimicombe et al.]There is some work on geo-regression techniques (to

be discussed later)…

6Comment: Most work centers on extraction global knowledge from spatial datasets

Page 7: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science

Preview: A Framework for Extracting Regional Knowledge from Spatial Datasets

RD-Algorithm

Application 1: Supervised Clustering [EVJW07]Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07]Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08]Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08]Application 5: Find “representative” regions (Sampling)Application 6: Regional Regression [CE09]Application 7: Multi-Objective Clustering [JEV09]Application 8: Change Analysis in Related Datasets [RE09]

Wells in Texas:Green: safe well with respect to arsenicRed: unsafe well

=1.01

=1.04

UH-DMML

7

Page 8: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

2. Region Discovery Framework8

Page 9: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Region Discovery Framework2 We assume we have spatial or spatio-temporal datasets

that have the following structure: (<spatial attributes>;<non-spatial attributes>) e.g. (longitude, lattitude, class_variable) or (longitude,

lattitude, continous_variable) Clustering occurs in space of the spatial attributes;

regions are found in this space. The non-spatial attributes are used by the fitness function

but neither in distance computations nor by the clustering algorithm itself.

For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same.

9

Page 10: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Region Discovery Framework3The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clusterings X={c1,…,ck} as follows:

q(X)= cX reward(c)=cX i(c) size(c) with 1

Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous (each pair of objects belonging to c i has to be

delaunay-connected with respect to ci and to d)4. c1…ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives, and

low reward clusters are frequently not reported

10

Page 11: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Measure of Interestingness i(c) The function i(c) is an interestingness measure for

a region c, a quantity based on domain interest to reflect how “newsworthy” the region is.

In our past work, we have designed a suite of measures of interestingness for: Supervised Clustering [PKDD06] Hot spots and cool spots [ICDM06] Scope of regional patterns [SSTDM07, GE011] Co-location patterns involving continuous variables

[PAKDD08, ACM-GIS08] High-variance regions involving a continuous variable

[PAKDD09] Regional Regression [ACM-GIS09]

11

Page 12: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Example1: Finding Regional Co-location Patterns in Spatial Data

Objective: Find co-location regions using various clustering algorithms and novel fitness functions.

Applications:1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co-

location and regions in blue have anti co-location.

2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas’ ground water supply.

Figure 2 indicates discovered regions and their associated chemical patterns.

Figure 1: Co-location regions involving deep andshallow ice on Mars

Figure 2: Chemical co-location patterns in Texas Water Supply

12

Page 13: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Example 2: Regional RegressionGeo-regression approaches: Multiple regression functions are

used that vary depending on location.

Regional Regression:

I. To discover regions with strong relationships between dependent & independent variables

II. Construct regional regression functions for each region

III. When predicting the dependent variable of an object, use the regression function associated with the location of the object

13

Page 14: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Challenges for Region Discovery1. Recall and precision with respect to the discovered regions

should be high

2. Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets”

3. Detection of regions at different levels of granularities (from very local to almost global patterns)

4. Detection of regions of arbitrary shapes

5. Necessity to cope with very large datasets

6. Regions should be properly ranked by relevance (reward)

7. Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.

14

Page 15: Frameworks and Algorithms for  Regional Knowledge Discovery

Clustering with Plug-in Fitness Functions

In the last 5 years, my research group developed families of clustering algorithms that find contiguous spatial clusters that by maximizing a plug-in fitness function.

This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for.

Additionally, more recently hotspot discovery techniques that find interesting regions for polygonal datasets, such as zip-code-based datasets are developed.

15

Page 16: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

3. Current Suite of Clustering Algorithms Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG, SCHG Agglomerative: MOSAIC, SCAH Density-based: SCDE, DCONTOUR

Clustering Algorithms

Density-based

Agglomerative-basedRepresentative-based

Grid-based

16

Page 17: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Representative-based Clustering

Attribute2

Attribute1

1

2

3

4

Objective: Find a set of objects OR such that the clustering X

obtained by using the objects in OR as representatives minimizes q(X).

Characteristic: cluster are formed by assigning objects to the closest representativePopular Algorithms: K-means, K-medoids, CLEVER,…

17

Page 18: Frameworks and Algorithms for  Regional Knowledge Discovery

Is a representative-based clustering algorithm, similar to PAM.

Searches variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity.

In general, new clusters are generated in the neighborhood of the current solution by replacing, inserting, and replacing representatives.

Searches for optimal number of clusters

CLEVER [ACM-GIS08]

18

Page 19: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Advantages of Grid-based Clustering Algorithms

fast: No distance computations Clustering is performed on summaries and not

individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects)

Easy to determine which clusters are neighboring

Shapes are limited to union of grid-cells

19

Page 20: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science

Ideas SCMRG (Divisive, Multi-Resolution Grids)

Cell Processing Strategy

1. If a cell receives a reward that is larger than the sum of its rewards

its ancestors: return that cell.

2. If a cell and its ancestor do not receive any reward: prune

3. Otherwise, process the children of the cell (drill down)

20

Page 21: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science

Code SCMRG21

Page 22: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

4. Case Studies Regional Knowledge Extraction

4.1 Regional Regression 4.2 Regional Association Rule Mining & Scoping4.3 Association-List Based Discrepancy Mining of User Behavior 4.4 Co-location Mining to be skipped!

22

Page 23: Frameworks and Algorithms for  Regional Knowledge Discovery

Motivation

1st law of geography: “Everything is related to everything else

but nearby things are more related than distant things” (Tobler)

Frequently, coefficient estimates in spatial datasets spatially

vary.

Question: How do we capture the regional variation of

regression coefficients?

4.1 Regional Regression

24

Page 24: Frameworks and Algorithms for  Regional Knowledge Discovery

Motivation

Regression Trees Data is split in a top-down approach using a greedy

algorithm Discovers only rectangle shapes

Geographically Weighted Regression(GWR) an instance-based, local spatial statistical technique used

to analyze spatial non-stationarity. generates a separate regression equation for a set of

observation points-determined using a grid or kernel weight assigned to each observation is based on a

distance decay function centered on observation.

Other Geo-Regression Analysis Methods

25

Page 25: Frameworks and Algorithms for  Regional Knowledge Discovery

Motivation

Regression Result: A positive linear regression line

(Arsenic increases with increasing Fluoride concentration)

Example 1: Why We Need Regional Knowledge?

Fluoride

Ars

enic

26

Page 26: Frameworks and Algorithms for  Regional Knowledge Discovery

Motivation

A negative linear Regression line in both locations (Arsenic decreases with increasing Fluoride concentration) A reflection of Simpson’s paradox.

Example 1: Why We Need Regional Knowledge?

Fluoride

Ars

enic

Location 1Location 2

27

Page 27: Frameworks and Algorithms for  Regional Knowledge Discovery

Motivation

Example 2: Houston House Price Estimate

Dependent variable: House_Price Independent variables: noOfRooms, squareFootage, yearBuilt, havePool, attachedGarage, etc..

28

Page 28: Frameworks and Algorithms for  Regional Knowledge Discovery

Global Regression (OLS) produces the coefficient

estimates, R2 value, and error etc.. a single global model

This model assumes all areas have same coefficients

E.g. attribute havePool has a coefficient of +9,000

(~having a pool adds $9,000 to a house price)

In reality this changes. A house of $100K and a house of

$500K or different zip codes or locations.

Having a pool in a house in luxury areas is very different

(~$40K) than having a pool in a house in Suburbs(~$5K).

Example 2: Houston House Price Estimate

Motivation

29

Page 29: Frameworks and Algorithms for  Regional Knowledge Discovery

Motivation

Example 2: Houston House Price Estimate

$180,000

$350,000

Houses A, B have very similar characteristics OLS produces single parameter estimates for predictor variables like noOfRooms, squareFootage, yearBuilt, etc

31

Page 30: Frameworks and Algorithms for  Regional Knowledge Discovery

Motivation

Example 2: Houston House Price Estimate If we use zip code as regions, they are in same region

If we use a grid structure

They are in different regions but

some houses similar to B (lake view)

are in same region with A and this will

effect coefficient estimate

More importantly, the house

around U-shape lake show similar

pattern and should be in the same

region, we miss important information.

32

Page 31: Frameworks and Algorithms for  Regional Knowledge Discovery

We need to discover arbitrary shaped regions, and not

rely on some a priori defined artificial boundaries

Our Approach: Capture the True Pattern Structure!

Problems to be solved: 1. Find regions whose objects have a

strong relationship between the

dependent variable and

independent variables

2. Extracting Regional Regression

Functions

3. Develop a method to select which

regression function to use for a

new object to be predicted.

Motivation

33

Page 32: Frameworks and Algorithms for  Regional Knowledge Discovery

So, what Can we use as Interestingness?

The natural first candidate is Adjusted R2. R-sq is a

measure of the extent to which the total variation of the

dependent variable is explained by the model.

R-sq alone is not a good measure to assess the goodness

of fit; only deals with the bias of the model & ignores the

complexity of model which leads to overfitting

There are better model selection criteria to balance the

tradeoff between bias and the variance.

Methodology

35

Page 33: Frameworks and Algorithms for  Regional Knowledge Discovery

Fitness Function Candidates R2-based fitness functions Fitness functions that additionally consider model

complexity, in addition to goodness of fit, such as AIC or BIC

Regularization approaches that penalize large coefficients.

Fitness functions that employ validation sets that provide a better measure for the generalization error—the model’s performance on unseen examples

An improvement of the previous approach that additionally considers training set/test set similarity

Combination of approaches mentioned above

Methodology

36

Page 34: Frameworks and Algorithms for  Regional Knowledge Discovery

R-sq Based Fitness Function

Given; and

The interestingness is:

To battle the tendency towards having small size regions with high correlation (false correlation):

used scaled version of the fitness function employed a parameter to limit the min. size of the region

The Rsq-based fitness function then becomes;

1

( ) = ( )* ( )k

Rsq Rsq j jj

q R i r size r

Methodology

1( ) =

0Rsq

SSEif n minRegSize

SSTi r

if n minRegSize

2

1

( )n

i ii

SSE y y

2

1

( )n

ii

SST y y

37

Page 35: Frameworks and Algorithms for  Regional Knowledge Discovery

AIC Based Fitness Function (AICFitness)

We prefer Akaike’s Information Criterion (AIC) because;

it takes model complexity (number of observations etc..) into

consideration more effectively

AIC provides a balance between bias and variance, and is

estimated using the following formula:

Variations of AIC including AICu [McQuarrie] which is used for

small size data is available good fit for our small size regions

Methodology

2 [ln(2 . / ) 1]AIC k n SSE n

ln2u

SSE n kAIC

n k n k

38

Page 36: Frameworks and Algorithms for  Regional Knowledge Discovery

AIC Based Fitness Function (AICFitness) AIC-based Interestingness – iAIC (r)

AICFitness function then becomes

AICFitness function repeatedly applies regression analysis during the search for the optimal set of regions which overall provides best AIC values (minimum)

Methodology

1

2 [ln(2 . / ) 1]( ) = 1

ln2

AIC

if n thk n SSE n

i rif n th

SSE n kn k n k

1

( ) = ( )* ( )k

AIC AIC j jj

q R i r size r

39

Page 37: Frameworks and Algorithms for  Regional Knowledge Discovery

Controlling Regional Granularity β is used to control the number of regions to be discovered, thus overall model complexity. Finding a good value for β means striking the right balance between underfitting and overfitting for a given

dataset. Small values for small number of regions; large values for large number of regions

Methodology

Reminder—Region Discovery Framework Fitness Function:

q(X)= cX reward(c)=cX i(c) size(c)

40

Page 38: Frameworks and Algorithms for  Regional Knowledge Discovery

Generalization Error Improvement (SSE_TE)

Experiments & Results

Discovered regions and their regional regression coefficients

perform better prediction compared to the global model

Some regions with very high error reduce the overall accuracy

but still 27% improvement. (future work item)

Relationship between variables spatially varies

βSSE_TE

(GL)SSE_TE(REG2)

SSEImprovement

% of objectsbetter prediction

1.1 17,182 12,566 27% 72%

1.7 17,182 14,799 26% 65%

Generalization Error Results - Boston Housing Data

41

Page 39: Frameworks and Algorithms for  Regional Knowledge Discovery

Experiments & Results

Regional regression coefficients perform just slightly better

prediction

Some due to external factors, e.g. toxic waste, power plant

(analyzed previously using PCAFitness approach, MLDM09)

Some regions with very high error reduce the overall accuracy

Still around 60% of objects are better predicted

Open for improvement; new fitness functions

βSSE_TE

(GL)SSE_TE(REG2)

SSEImprovement

% of objectsbetter

prediction

1.1 102, 578 98,879 3.6% 57%

1.25 102, 578 92,200 8.01% 61%

Generalization Error Results – Arsenic Data

42

Page 40: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science

4.2 A Framework for Regional Association Rule Mining and Scoping [GeoInformatica10]

Step 1: Region DiscoveryStep 1: Region Discovery

Step 2: Regional Association Rule Mining

Step 2: Regional Association Rule Mining

Step 3: Regional Association Rule Scoping

Step 3: Regional Association Rule Scoping

Arsenic hot spots

An association rule ais discovered.

Scope ofthe rule a

43

Page 41: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Arsenic Hot Spots and Cool Spots

Step 1: Region DiscoveryStep 1: Region Discovery

Step 2: Regional Association Rule Mining

Step 2: Regional Association Rule Mining

Step 3: Regional Association Rule Scoping

Step 3: Regional Association Rule Scoping

44

Page 42: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Example Regional Association Rules

Step 1: Region DiscoveryStep 1: Region Discovery

Step 2: Regional Association Rule Mining

Step 2: Regional Association Rule Mining

rule 1

rule 3

rule 2

rule 4

Step 3: Regional Association Rule Scoping

Step 3: Regional Association Rule Scoping

45

Page 43: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Region vs. Scope Scope of an association rule

indicates how regional or global a local pattern is.

The region, where an association rule is originated, is a subset of the scope where the association rule holds.

46

Page 44: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Association Rule Scope Discovery FrameworkLet a be an association rule, r be a region, conf(a,r) denotes the confidence of a in region r, and sup(a,r) denotes the support of a in r.

Goal: Find all regions for which an associate rule a satisfies its minimum support and confidence threshold; regions in which a’s confidence and support are significantly higher than the min-support and min-conf thresholds receive higher rewards.

Association Rule Scope Discovery Methodology:For each rule a that was discovered for region r’, we run our region discovery algorithm that defines the interestingness of a region ri with respect to an association rule a as follows:

Remarks: Typically 1=2=0.9; =2 (confidence increase is more important than support

increase) Obviously the region r’ from which rule a originated or some variation of it should

be “rediscovered” when determining the scope of a.

47

Page 45: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Regional Association Rule Scoping

Ogallala Aquifer

Gulf Coast Aquifer

48

Page 46: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Fine Tuning Confidence and Support

We can fine tune the measure of interestingness for association rule scoping by changing the minimum confidence and support thresholds.

49

Page 47: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

4.3 (Regional) Models for Internet User Behaviour

Problem: We are interested in finding spatial patterns with respect to a performance variable based on some context that is described using a set of variables.

Main Theme: We try to find factors that influence if a user clicks for given ad (e.g. CTR changes based on the keywords that occur in the ad / socio-ecomic factors / proximity to spatial objects of a particular type/...)

Complication: Datasets are very large, most data are only available at zip-code level. Our subtopic: As usual, we are interested in extracting knowledge concerning the „regional variation of clicking behavior“.Contributors: Ruth Miller, Chun-sheng Chen, Yahoo! Colloaborator: Abraham Bagherjeiran

50

Page 48: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Data Set: Yahoo! Contextual Ads1. Data Source:

Keystone (contextual Ads) Dataset: January-March, 2009 WOEID database

used to identify the user’s location who see the ad to find the neighboring zip codes given a zip code

2. Experiments are based on a subset from the keystone data set:1. Ads without geo-targeting tags2. Only the rank 1 ads3. Shown on top 5 Yahoo! domains (Y!.finance|Y!.news| Y!. sports| Y!.

groups| Y!. maps)4. Compute the CTR and conversion rate for each zip code

Regional CTR threshold: a zip code must has at least 1000 impressions and 100 clicks

3. Final Dataset: 13,869 zip codes with their CTR & conversion rate4. Goal: Find interesting associations of this dataset with co-location and

census datasets

50a

Page 49: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science

Data Set: Census data US Census 2000

5 Digit Zip Code

Total Population

Total Population who are White

Total Population who are African American

Total Population who are American Indian

Total Population who are Asian

Total Population who are Hawaiian or Pacific Islander

Total Population who are Some other Race

Total Population who are 2 or more Races

Percent of Total Population who are White

Percent of Total Population who are African American

Percent of Total Population who are American Indian

Percent of Total Population who are Asian

Percent of Total Population who are Hawaiian or Pacific Islander

Percent of Total Population who are Some other Race

Percent of Total Population who are 2 or more Races

Per Capita Income

Percent of Total Population with Education up to 12th grade

Percent of Total Population with Education up to Bachelors Degree

Percent of Total Population with Education up to Masters Degree

Percent of Total Population with Education up to Ph.D. or Profession Degree

Percent of Total Population with Education higher than Masters Degree

50b

Page 50: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Global Interestingness Analysis Comparison of US zip codes to Zip Codes with

Whole Food Markets stores

Zip codes with Whole Food Market stores has a lower overall CTR but have a higher number of per person impression and click counts.

Global US Zip Codes Zip Codes with WholeFood MarketAvg. CTR 0.000926 0.000404Avg. Click / Person 0.218 0.653Avg. Impression / Person 522 2062Avg. Population 17844 23398

50v

Page 51: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science

ZIPS Hotspot Discovery Algorithm

Input: a interestingness function F, a list of n initial zip regions zlist, interestingness threshold t

Set HotspotList := empty

Set NeighborList := empty

For each region z in zlist {

If(F(z)>t) {

Add (neighbor zip codes of z – Hotspots) and add to the NeighborList;

While (size of NeighborList > 0) {

Remove one zip code M from NeighborList;

If (F(M+z) > t){

Merge M to z;

}

Mark M as processed and add unprocessed neighbor zip codes of M to the NeighborList ;

}

Add z to HotspotList;

} }

Return HotspotList;

An Agglomerative Growing Algorithm; it starts with a seed zip code merges neighboring zip codes, if the resulting region is above an interestingness threshold

Neighboring zip codes are obtained from a lookup table created from the WOEID database

50d

Page 52: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

ZIPS Output Sample

Regions for which the Correlation between percentage of Bachelors degrees and CTR is below 0.7.

50e

Example: Negative Correlation Interestingness Hotspots between Bachelor’s degree & CTR

Page 53: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

LA Area Neg. Corr. Income vs. CTRInterestingness Threshold -0.8

Zip codes of interest is outlined in yellow

50f

Page 54: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Scatter Plot of LA CTR/Income Z-Scores

-1.5

-1

-0.5

0

0.5

1

-2 -1 0 1 2 3 4 5 6

Interestingness Threshold -0.8Income (z-score) vs. CTR (z-score)

50g

Page 55: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

LA Income vs. CTR

Income

CTR

Average Income 0.006±0.5Average CTR -0.254 ±0.540% 74/139 zip codes

High IncomeHigh CTR

4% 7/185 zip codes

High IncomeLow CTR

21% 39/185 zip codes

Low IncomeHigh CTR

30% 56/185 zip codes

Low IncomeLow CTR

5%9/185 zip codes

50h

Page 56: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

North East DC Area Interestingness Threshold 0.8

Zip codes of interest is outlined in yellow

50i

Page 57: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Scatter Plot of NE-DC Income/CTR Z-score

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

Interestingness Threshold 0.8Income (z-score) vs. CTR (z-score)

50j

Page 58: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

NE-DC Income vs. CTR

Income

CTR

Average Income -0.2 ±0.5Average CTR -0.53 ±0.5

36% 9/25 zip codes

High IncomeHigh CTR

24% 6/25 zip codes

High IncomeLow CTR

4% 1/25 zip codes

Low IncomeHigh CTR

0% 0/25 zip codes

Low IncomeLow CTR

36%9/25 zip codes

50k

Page 59: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Accomplishments Yahoo! Project“Completed “ Tasks:

a. Frameworks to Analyze Spatial Associations of a Continuous Variable with Other Factors

b. Spatial Hotspot Discovery and Regional Scoping Techniques

c. Finding (Spatial) Correlation-based Associations of CTR with other Factors (mostly based on Contextual Ad Datasets)

d. Dataset Set Creation (Mostly for task c) Census-based Datasets (each of the dataset is done for 5

digit zip code, summarized into three digit zip codes regions (by combine all the zip codes with similar first 3 digits) and 2 digit zip code regions):

Co-location Datasets US zip code boundary polygons (for visualization purpose)

 

51

Page 60: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Accomplishments Yahoo! Project2Partially Completed Tasks

a. Visualization Tools that Display Interestingness Hotspots

b. Analyzing Relationships between CTR and Conversions

c. Finding Co-location based Associations of CTR

d. Finding Regional and Global Patterns based on Sets of Binary Variables

Proposed and Just Started Tasks:

a. Geo-feature Creation and Evaluation

b. Mining for Promising Binary Contexts for Contiguous Variables

c. Mining the Look-a-like Modeling Datasets

d. Generalizing CLEVER for Interestingness Hotspot Discovery

52

Page 61: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Subtopics:

• Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) [SDE10]

• Change Analysis ( “what is new/different?”) [CVET09]

• Correspondence Clustering (“mining interesting relationships between two or more datasets”) [RE10]

• Meta Clustering (“cluster cluster models of multiple datasets”)

• Analyzing Relationships between Polygonal Cluster Models

Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth.

Novelty (r’) = (r’—(r1 … rk))

Emerging regions based on the novelty change predicate

Time 1 Time 2

5. Methodologies and Tools toAnalyze Related Datasets

53

Page 62: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

6. Summary1. A framework for region discovery that relies on additive,

reward-based fitness functions and views region discovery as a clustering problem has been introduced.

2. Families of clustering algorithms and families of measures of interestingness are provided that form the core of the framework.

3. Evidence concerning the usefulness of the framework for regional association rule mining, correlation analysis, regional regression, and co-location mining has been presented.

4. The special challenges in designing clustering algorithms for region discovery have been identified. Current work centers on the parallel implementation of some of those algorithms.

5. The ultimate vision of this research is the development of region discovery engines that assist data analysts and scientists in finding interesting regions in spatial datasets.

54

Page 63: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

Other Contributors to the Work Presented TodayGraduated PhD Students: Wei Ding (Regional Association Rule Mining, Grid-based Clustering) Rachsuda Jiamthapthaksin (Agglomerative Clustering, Multi-Run Clustering) Oner Ulvi Celepcikay (Regional Regression) Vadeerat Risurongkawong (Analyzing Multiple Datasets, Change Analysis)Current PhD Students Chun-sheng Chen (Density based Clustering, Regional Knowledge Extraction) Ruth Miller (Dataset Creation, Models for Internet Behavior)Graduated Master Students Rachana Parmar (CLEVER, Co-location Mining) Seungchan Lee (Grid-based Clustering, Agglomerative Clustering) Dan Jiang (Density-based Clustering, Co-location Mining) Jing Wang (Grid-based and Representative-based Clustering)Software Platform and Software Design Abraham Bagherjeiran (PhD student UH, now at Yahoo!)Domain Experts Tomasz Stepinski (Lunar and Planetary Institute, Houston, Texas) J.-P. Nicot (Bureau of Economic Geology, UT, Austin) Michael Twa (College of Optometry, University of Houston)

55

Page 64: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science

Inputs: Dataset O, k’, neighborhood-size, p, p’, Outputs: Clustering X, fitness q Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. Create p neighbors of the current solution randomly using the given neighborhood definition. 3. If the best neighbor improves the fitness q, it becomes the current solution. Go back to step 2. 4. If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution; otherwise, go back to step 2 replacing the current solution by the best solution found by re- sampling.

CLEVER Pseudo Code

56

Page 65: Frameworks and Algorithms for  Regional Knowledge Discovery

Department of Computer Science Christoph F. Eick

A example of the WOEID neighbors lookup table

<Woeid> <zip code> <neighbor woeids>

12763511 14880 {(12763450),(12763524),(12763447),(12763456),(12763408)}

12763512 14881 {(12763460),(12762641),(12763485)}

12763513 14882 {(12762641),(12762645),(12762659),(12763485),(12762643),(12762651),(12763484)}

12763514 14883 {(12763049),(12762969),(12763519),(12763494),(12763485),(12763500),(12763023),(12762963)}

12763515 14884 {(12763465),(12763473),(12763381),(12763231)}

12763516 14885 {(12763476),(12763526),(12763508),(12764545),(12764526),(12764527),(12763490)}

12763517 14886 {(12763461),(12763449),(12763484),(12763478),(12763500),(12763501),(12763485)}

12763560 15034 {(12763620),(12763627),(12763568)}

12763561 15035 {(12763632),(12763641)}

12763563 15037 {(12763600),(12763630),(12763552),(12763584),(12763542),(12763546),(12763606),(12763568),(12763628)}

12763695 15260 {(12763696),(12763654),(12763658)}

12763696 15261 {(12763695),(12763654)}

The size of the table: 29,692 lines

50e