Relational Data Mining and Web Mining00000000-1801-9a75-0000-000043cf6ecb/... · 9 Relational Data Mining and Web Mining ! Relational Data Mining differs from regular DM in several

Relational Data Mining and Web Mining

Prof. Dr. Daning Hu Department of Informatics University of Zurich Nov 20th, 2012

2

Outline n  Introduction: Big Data

n  Relational Data Mining

n  Web Mining

n  Ref Book: Web Intelligence, Zhong et al. ¨  http://www.amazon.com/Web-Intelligence-Ning-Zhong/dp/3540443843

Introduction: Big Data n  90% of the data in the world today has been created in the

last two years alone (IBM). Big data comes from everywhere: ¨ sensors used to gather climate information, ¨ posts to social media sites, ¨ digital pictures and videos, ¨ purchase transaction records, etc.

n  In response, everyone from marketers to policymakers has begun embracing a loosely defined term for today's massive data sets and the challenges they present: Big Data. ¨ Lack of efficient and effective methods ¨ Big Brother

A Brief History of Big Data

4

§  Herman Hollerith census data (electric hole pouching) 1887-90

1935-37 §  FDR’s Social Security Act §  26 million working Americans and 3 million emplyers §  IBM, field investigators

§  Internet Ages and 9/11 §  NSA: 1.7 billion emails, phone calls, daily §  Retailers amassing information on shopping habbits §  Wal-Mart: 460 T cache in 2004 §  Social Network Profilerate §  U.S. Open Government Initiative: data.gov

1943 –

1960s

§  WWII and Cold War §  “Colossus” Project: Deciphering Nazi Codes §  742M U.S. tax returs and 175M fingerprints -> Privacy act

1990s –

2000s

2012

§  ?

Introduction: Data Mining n  Data mining (the analysis step of the "Knowledge Discovery in

Databases" process, or KDD) is the process that attempts to discover patterns in large data sets. ¨ a field at the intersection of computer science and statistics ¨ AI, machine learning, statistics, and database systems

n  The goal of is to extract information from a large data set and transform it into an understandable structure for further use ¨ Data -> Information -> Knowledge ¨  Involving analysis, data preprocessing & management,

model and inference considerations, complexity considerations, post-processing of discovered structures, visualization, and online updating (real-time).

6

Data Mining and Web Mining

n  The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: ¨  Collection and Selection

¨  Pre-processing

¨  Transformation

¨  Data Mining (Analysis)

¨  Interpretation/Evaluation

Web Mining

7

Data Mining Tasks n  Major Data Mining Tasks:

¨  Association rule learning (Dependency modeling) – Searches for relationships between variables. E.g., a supermarket might gather data on customer purchasing habits.

¨  Clustering – discovering groups and structures in the data that are in some way or another "similar", without using known structures.

¨  Classification – generalizing known structure to apply to new data. E.g., software classifying an e-mail as "spam". (Training dataset)

¨  Regression – Attempts to find a function which models the data with the least error.

¨  Summarization – providing a more compact representation of the data set, including visualization and report generation.

8

Data Mining Tasks n  Major Data Mining Tasks:

¨  Association rule learning (Dependency modeling) – Searches for relationships between variables. E.g., a supermarket might gather data on customer purchasing habits.

¨  Clustering – discovering groups and structures in the data that are in some way or another "similar", without using known structures.

¨  Classification – generalizing known structure to apply to new data. E.g., software classifying an e-mail as "spam". (Training dataset)

¨  Regression – Attempts to find a function which models the data with the least error.

¨  Summarization – providing a more compact representation of the data set, including visualization and report generation.

9

Relational Data Mining and Web Mining n  Relational Data Mining differs from regular DM in several ways

¨  Network-based Representation – Often involves large-scale relational data and can be modeled with network measures/metrics.

¨  Network-based Models and Algorithms (HITS)

¨  The tasks are often similar: Classification, Regression etc. But the application goal often requires analytical insights about the relations among entities in the data set.

n  Web Mining ¨  Collecting large-scale Web based data or data from Internet

¨  DM Analysis on Web itself (E.g., Google’s PageRank)

Applications of MI for U.S. Border Safety n  Border-crossing records can be considered as a stream

of text (license plates) ordered by the time of crossing. ¨ MI can be used to identify frequent co-occurrence between a

pair of vehicle crossings.

n  If one vehicle in the pair has a criminal record, some inferences may be made about the second vehicle if they cross together frequently.

n  We use conditional probability to include domain heuristics in the MI formulation.

n  The heuristics are derived from information recorded in multiple law-enforcement databases.

Case Study: Association Rule Mining in CopLINK §  The COPLINK dataset contains data from multiple law

enforcement agencies from 1990 - 2006 §  3 million incident reports

§  Their personal and sociological information (age, ethnicity, etc.)

§  Time information: when two individuals co-offend

§ TPD, PCSD, CBP (Six ports between AZ and Mexico)

§  A Integrated Criminal Dataset

§  1.44 million criminals

§  662,000 vehicles

11

TPD PCSD CBP

Number of People 662,527 640,733 17.6 M record ( 2.6 M vehicles)

Time Span 1990 - 2006 1990 - 2006 2004 - 2006

Table 1. Summary of the COPLINK vehicle dataset

Association Rule Mining n  Inferring associations between items in the database was

motivated by decision support problems faced by retail organizations (Stonebraker 1993).

n  An association rule (AR) is a relationship of the form A B ¨  A is the antecedent item-set and B is the consequent item-set. ¨  The antecedent and consequent item-sets can contain multiple items.

n  A B holds in a transaction set D with ¨  confidence ‘c’ if c% of transactions in D that contain A also contain B, ¨  support ‘s’ if s% of transactions in D contain both A and B.

n  Association mining identifies all the rules that have support and confidence greater than user-specified thresholds.

Mutual Information n  Mutual information is an information theoretic measure that can

be used to identify interesting co-occurrences of objects.

n  It can be considered a subset of AR mining with 1-item antecedent and consequent item-sets.

n  The earliest definitions of MI was given by Claude et al. (1949) and Fano (1961) as the amount of information provided by the occurrence of an event (y) about the occurrence of another event (x):

n  Intuitively, this concept measures if the co-occurrence of x and y (P(x,y)) is more likely than their separate occurrences (P(x).P(y)).

,2

( )( ; ) log( ) ( )P x yI x yP x P y

=

14

Research Design (cont.)

Border Crossing Data Six Ports

TPD PCSD

Law Enforcement Data*

2/3

1/3

Training Data

Testing Data

Heuristic Calculation

Criminal Vehicles with

Crossings

Potential Target Vehicles

MIW/MIC

Splitting

Overlap

Subset

Evaluation

Narcotics Vehicles

Set A

Set B

Web-Spider Internet Archive

Border Wait Times

TPD PCSD

Law Enforcement Data*

Scores

Research design and process explained in the following slides

Estimating Border Wait Times n  An aerial photograph of a

typical U.S. port of entry (southern border).

n  Vehicle lanes are backed up with dozens of vehicles during peak times.

n  Criminal vehicles operate in groups. ¨  If one is caught others

turn back into Mexico. n  They may join the lines one

at a time or use turn-out points.

Vehicle lanes

Turn-out points

Turn-out points

Port of Entry(Check points)

© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQTM

n  Thus, time interval between two related vehicles is likely to be less or equal to the waiting time if the second vehicle doesn’t join the line until the first vehicle goes through.

n  This needs to be taken into consideration in the calculation of MI.

Estimating Border Wait Times

n  CBP publishes hourly wait times on its website (BWT). ¨  The information is posted only for the current day ¨  No publicly available archive is maintained ¨  A web-spider was used to systematically download the web-page for every

hour over several days in April 2006 ¨  However, the average waiting times thus obtained cannot be generalized

to the entire year

n  The Internet Archive (IA) contained snapshots of the BWT web-page from April 10, 2004 to March 31, 2005. ¨  Obtain waiting time statistics for various days over many months in 2004

and 2005 ¨  The statistics from the spidering process and IA were then used to

calculate average waiting times for each port on an hourly basis and used in MIW.

17

Temporal Patterns of Border Crossings

n  The figure suggests that a large number (≈50%) of crossings with police contacts happen after dark.

n  MIW uses this information to assign more weight to time periods with more criminal crossings.

•  Figure (a) shows the percentage of all crossings over six time periods of the a day.

–  23% of all crossings take place between 8pm-Midnight.

•  Figure (b) shows the

percentage of all crossings by vehicles with police contacts over the six time periods.

–  27% of crossings by vehicles with police contacts happen between 8pm-Midnight.

Midnight-5am12%

5am-10am10%

10am-2pm20%

2pm-4pm13%

4pm-8pm22%

8pm-Midnight23%

Midnight-5am15%

5am-10am10%

10am-2pm14%

2pm-4pm10%

4pm-8pm24%

8pm-Midnight27%

Night

DayNight

Day

(a) (b)

Comparative Evaluation (cont.)

n  For hypothesis testing, thirty data points (ranging from top 5 to 3500 pairs) were taken for each of the measures and a t-test was done for the differences in the mean number of potentially criminal vehicles identified.

n  It was found that MIW performed significantly better (at the 99% level) than MIC in all but one dataset in identifying potentially criminal vehicles. ¨  The hypothesis on MIW performing better was partially supported.

MIW - MIC

TPD dataset 0.2194 PCSD dataset 0.0001*

Tucson met. dataset 0.0009*

Case 1: Vehicle Pair Identified by MIW

n  This figure shows the crossing patterns of a pair of vehicles with the high MIW score.

•  Vehicle C from Arizona and it’s occupant were arrested in Tucson for the sale of narcotics.

•  Vehicle C crossed 7 times in a one month period and crossed within a few minutes of Vehicle D. •  The crossings may be considered suspicious since they are almost always after dark and do not fit a standard work schedule.

0

500

1000

1500

2000

Jan 15

Jan 25

Jan 26

Jan 29

Feb 6

Feb 7

Feb 14

Vehicle C Vehicle D

After dark/No fixed work schedule

Tim

e of

Day

Criminal Activity of Vehicle C & D Tucson met. area – Narcotics Network Customs and Border Protection

Frequent Crossers at Night

MIW

0

500

1000

1500

2000

Jan

15

Jan

25

Jan

26

Jan

29

Fe

b 6

Fe

b 7

Vehicle A Vehicle B

Vehicle C Vehicle D

Tucson met. area Criminal Network

n  Vehicle C was found to have strong connections to a narcotics network in the Tucson metropolitan area. It had links to other people and vehicles that had been arrested / suspected for narcotics sales and possession in the region.

n  Vehicle D was also involved in criminal activity in the Tucson region. n  MIW identified many other such strong cases.

A Suspect Vehicle Triple Identified

n  MIW scores were calculated between Vehicle F and other crossing vehicles and a promising transitive association with Vehicle G was found. Vehicle G had crossed 3 times within minutes Vehicle F over a 12 day period.

0

500

1000

1500

2000

Sep 6

Sep 11

Sep 17

Sep 18

Sep 25

Oct 4

Oct 5

Vehicle E Vehicle F Vehicle G

Tim

e of

Day

Dates (2005)

After dark

•  This figure shows the crossing patterns of vehicle triple that was identified by the transitive use of MIW with support constraints. •  Vehicle F crossed 7 times in a one month period out of which it crossed 5 times within a few minutes of Vehicle E. •  It was also found that Vehicle E was involved in multiple narcotics crimes in the Tucson region in recent times.

Crime Involvement of Vehicles E and G

n  Vehicle E was involved in narcotics crimes and Vehicle G was found to be involved in suspicious activity and forgery.

n  Since the procedure used MIW, it indicates that the vehicles may have been simultaneously waiting in line at the same port of entry.

n  This example clearly shows that the transitive use of MIW shows promise in identifying potentially criminal vehicles.

MIW

Vehicle C Vehicle E

0

500

1000

1500

2000

No

v 12

No

v 26

Jan

14

Fe

b 4

Vehicle A Vehicle B

Vehicle F

0

500

1000

1500

2000

Se

p 5

Se

p 6

Se

p 1

1

Se

p 1

7

Se

p 1

8

Se

p 2

5

Oct 2

Oct 4

Oct 5

Vehicle A Vehicle B Vehicle C

MIW

Vehicle G

Tucson met. area Narcotics Crimes

Customs and Border Protection Tucson met. area Crimes

Documents

Relational Data Mining and Web Mining00000000-1801-9a75-0000-000043cf6ecb/... · 9 Relational Data Mining and Web Mining ! Relational Data Mining differs from regular DM in several