Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Relational Data Mining and Web Mining
Prof. Dr. Daning Hu Department of Informatics University of Zurich Nov 20th, 2012
2
Outline n Introduction: Big Data
n Relational Data Mining
n Web Mining
n Ref Book: Web Intelligence, Zhong et al. ¨ http://www.amazon.com/Web-Intelligence-Ning-Zhong/dp/3540443843
Introduction: Big Data n 90% of the data in the world today has been created in the
last two years alone (IBM). Big data comes from everywhere: ¨ sensors used to gather climate information, ¨ posts to social media sites, ¨ digital pictures and videos, ¨ purchase transaction records, etc.
n In response, everyone from marketers to policymakers has begun embracing a loosely defined term for today's massive data sets and the challenges they present: Big Data. ¨ Lack of efficient and effective methods ¨ Big Brother
A Brief History of Big Data
4
§ Herman Hollerith census data (electric hole pouching) 1887-90
1935-37 § FDR’s Social Security Act § 26 million working Americans and 3 million emplyers § IBM, field investigators
§ Internet Ages and 9/11 § NSA: 1.7 billion emails, phone calls, daily § Retailers amassing information on shopping habbits § Wal-Mart: 460 T cache in 2004 § Social Network Profilerate § U.S. Open Government Initiative: data.gov
1943 –
1960s
§ WWII and Cold War § “Colossus” Project: Deciphering Nazi Codes § 742M U.S. tax returs and 175M fingerprints -> Privacy act
1990s –
2000s
2012
§ ?
Introduction: Data Mining n Data mining (the analysis step of the "Knowledge Discovery in
Databases" process, or KDD) is the process that attempts to discover patterns in large data sets. ¨ a field at the intersection of computer science and statistics ¨ AI, machine learning, statistics, and database systems
n The goal of is to extract information from a large data set and transform it into an understandable structure for further use ¨ Data -> Information -> Knowledge ¨ Involving analysis, data preprocessing & management,
model and inference considerations, complexity considerations, post-processing of discovered structures, visualization, and online updating (real-time).
6
Data Mining and Web Mining
n The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: ¨ Collection and Selection
¨ Pre-processing
¨ Transformation
¨ Data Mining (Analysis)
¨ Interpretation/Evaluation
Web Mining
7
Data Mining Tasks n Major Data Mining Tasks:
¨ Association rule learning (Dependency modeling) – Searches for relationships between variables. E.g., a supermarket might gather data on customer purchasing habits.
¨ Clustering – discovering groups and structures in the data that are in some way or another "similar", without using known structures.
¨ Classification – generalizing known structure to apply to new data. E.g., software classifying an e-mail as "spam". (Training dataset)
¨ Regression – Attempts to find a function which models the data with the least error.
¨ Summarization – providing a more compact representation of the data set, including visualization and report generation.
8
Data Mining Tasks n Major Data Mining Tasks:
¨ Association rule learning (Dependency modeling) – Searches for relationships between variables. E.g., a supermarket might gather data on customer purchasing habits.
¨ Clustering – discovering groups and structures in the data that are in some way or another "similar", without using known structures.
¨ Classification – generalizing known structure to apply to new data. E.g., software classifying an e-mail as "spam". (Training dataset)
¨ Regression – Attempts to find a function which models the data with the least error.
¨ Summarization – providing a more compact representation of the data set, including visualization and report generation.
9
Relational Data Mining and Web Mining n Relational Data Mining differs from regular DM in several ways
¨ Network-based Representation – Often involves large-scale relational data and can be modeled with network measures/metrics.
¨ Network-based Models and Algorithms (HITS)
¨ The tasks are often similar: Classification, Regression etc. But the application goal often requires analytical insights about the relations among entities in the data set.
n Web Mining ¨ Collecting large-scale Web based data or data from Internet
¨ DM Analysis on Web itself (E.g., Google’s PageRank)
Applications of MI for U.S. Border Safety n Border-crossing records can be considered as a stream
of text (license plates) ordered by the time of crossing. ¨ MI can be used to identify frequent co-occurrence between a
pair of vehicle crossings.
n If one vehicle in the pair has a criminal record, some inferences may be made about the second vehicle if they cross together frequently.
n We use conditional probability to include domain heuristics in the MI formulation.
n The heuristics are derived from information recorded in multiple law-enforcement databases.
Case Study: Association Rule Mining in CopLINK § The COPLINK dataset contains data from multiple law
enforcement agencies from 1990 - 2006 § 3 million incident reports
§ Their personal and sociological information (age, ethnicity, etc.)
§ Time information: when two individuals co-offend
§ TPD, PCSD, CBP (Six ports between AZ and Mexico)
§ A Integrated Criminal Dataset
§ 1.44 million criminals
§ 662,000 vehicles
11
TPD PCSD CBP
Number of People 662,527 640,733 17.6 M record ( 2.6 M vehicles)
Time Span 1990 - 2006 1990 - 2006 2004 - 2006
Table 1. Summary of the COPLINK vehicle dataset
Association Rule Mining n Inferring associations between items in the database was
motivated by decision support problems faced by retail organizations (Stonebraker 1993).
n An association rule (AR) is a relationship of the form A B ¨ A is the antecedent item-set and B is the consequent item-set. ¨ The antecedent and consequent item-sets can contain multiple items.
n A B holds in a transaction set D with ¨ confidence ‘c’ if c% of transactions in D that contain A also contain B, ¨ support ‘s’ if s% of transactions in D contain both A and B.
n Association mining identifies all the rules that have support and confidence greater than user-specified thresholds.
Mutual Information n Mutual information is an information theoretic measure that can
be used to identify interesting co-occurrences of objects.
n It can be considered a subset of AR mining with 1-item antecedent and consequent item-sets.
n The earliest definitions of MI was given by Claude et al. (1949) and Fano (1961) as the amount of information provided by the occurrence of an event (y) about the occurrence of another event (x):
n Intuitively, this concept measures if the co-occurrence of x and y (P(x,y)) is more likely than their separate occurrences (P(x).P(y)).
,2
( )( ; ) log( ) ( )P x yI x yP x P y
=
14
Research Design (cont.)
Border Crossing Data Six Ports
TPD PCSD
Law Enforcement Data*
2/3
1/3
Training Data
Testing Data
Heuristic Calculation
Criminal Vehicles with
Crossings
Potential Target Vehicles
MIW/MIC
Splitting
Overlap
Subset
Evaluation
Narcotics Vehicles
Set A
Set B
Web-Spider Internet Archive
Border Wait Times
TPD PCSD
Law Enforcement Data*
Scores
Research design and process explained in the following slides
Estimating Border Wait Times n An aerial photograph of a
typical U.S. port of entry (southern border).
n Vehicle lanes are backed up with dozens of vehicles during peak times.
n Criminal vehicles operate in groups. ¨ If one is caught others
turn back into Mexico. n They may join the lines one
at a time or use turn-out points.
Vehicle lanes
Turn-out points
Turn-out points
Port of Entry(Check points)
© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQTM
n Thus, time interval between two related vehicles is likely to be less or equal to the waiting time if the second vehicle doesn’t join the line until the first vehicle goes through.
n This needs to be taken into consideration in the calculation of MI.
Estimating Border Wait Times
n CBP publishes hourly wait times on its website (BWT). ¨ The information is posted only for the current day ¨ No publicly available archive is maintained ¨ A web-spider was used to systematically download the web-page for every
hour over several days in April 2006 ¨ However, the average waiting times thus obtained cannot be generalized
to the entire year
n The Internet Archive (IA) contained snapshots of the BWT web-page from April 10, 2004 to March 31, 2005. ¨ Obtain waiting time statistics for various days over many months in 2004
and 2005 ¨ The statistics from the spidering process and IA were then used to
calculate average waiting times for each port on an hourly basis and used in MIW.
17
Temporal Patterns of Border Crossings
n The figure suggests that a large number (≈50%) of crossings with police contacts happen after dark.
n MIW uses this information to assign more weight to time periods with more criminal crossings.
• Figure (a) shows the percentage of all crossings over six time periods of the a day.
– 23% of all crossings take place between 8pm-Midnight.
• Figure (b) shows the
percentage of all crossings by vehicles with police contacts over the six time periods.
– 27% of crossings by vehicles with police contacts happen between 8pm-Midnight.
Midnight-5am12%
5am-10am10%
10am-2pm20%
2pm-4pm13%
4pm-8pm22%
8pm-Midnight23%
Midnight-5am15%
5am-10am10%
10am-2pm14%
2pm-4pm10%
4pm-8pm24%
8pm-Midnight27%
Night
DayNight
Day
(a) (b)
Comparative Evaluation (cont.)
n For hypothesis testing, thirty data points (ranging from top 5 to 3500 pairs) were taken for each of the measures and a t-test was done for the differences in the mean number of potentially criminal vehicles identified.
n It was found that MIW performed significantly better (at the 99% level) than MIC in all but one dataset in identifying potentially criminal vehicles. ¨ The hypothesis on MIW performing better was partially supported.
MIW - MIC
TPD dataset 0.2194 PCSD dataset 0.0001*
Tucson met. dataset 0.0009*
Case 1: Vehicle Pair Identified by MIW
n This figure shows the crossing patterns of a pair of vehicles with the high MIW score.
• Vehicle C from Arizona and it’s occupant were arrested in Tucson for the sale of narcotics.
• Vehicle C crossed 7 times in a one month period and crossed within a few minutes of Vehicle D. • The crossings may be considered suspicious since they are almost always after dark and do not fit a standard work schedule.
0
500
1000
1500
2000
Jan 15
Jan 25
Jan 26
Jan 29
Feb 6
Feb 7
Feb 14
Vehicle C Vehicle D
After dark/No fixed work schedule
Tim
e of
Day
Criminal Activity of Vehicle C & D Tucson met. area – Narcotics Network Customs and Border Protection
Frequent Crossers at Night
MIW
0
500
1000
1500
2000
Jan
15
Jan
25
Jan
26
Jan
29
Fe
b 6
Fe
b 7
Vehicle A Vehicle B
Vehicle C Vehicle D
Tucson met. area Criminal Network
n Vehicle C was found to have strong connections to a narcotics network in the Tucson metropolitan area. It had links to other people and vehicles that had been arrested / suspected for narcotics sales and possession in the region.
n Vehicle D was also involved in criminal activity in the Tucson region. n MIW identified many other such strong cases.
A Suspect Vehicle Triple Identified
n MIW scores were calculated between Vehicle F and other crossing vehicles and a promising transitive association with Vehicle G was found. Vehicle G had crossed 3 times within minutes Vehicle F over a 12 day period.
0
500
1000
1500
2000
Sep 6
Sep 11
Sep 17
Sep 18
Sep 25
Oct 4
Oct 5
Vehicle E Vehicle F Vehicle G
Tim
e of
Day
Dates (2005)
After dark
• This figure shows the crossing patterns of vehicle triple that was identified by the transitive use of MIW with support constraints. • Vehicle F crossed 7 times in a one month period out of which it crossed 5 times within a few minutes of Vehicle E. • It was also found that Vehicle E was involved in multiple narcotics crimes in the Tucson region in recent times.
Crime Involvement of Vehicles E and G
n Vehicle E was involved in narcotics crimes and Vehicle G was found to be involved in suspicious activity and forgery.
n Since the procedure used MIW, it indicates that the vehicles may have been simultaneously waiting in line at the same port of entry.
n This example clearly shows that the transitive use of MIW shows promise in identifying potentially criminal vehicles.
MIW
Vehicle C Vehicle E
0
500
1000
1500
2000
No
v 12
No
v 26
Jan
14
Fe
b 4
Vehicle A Vehicle B
Vehicle F
0
500
1000
1500
2000
Se
p 5
Se
p 6
Se
p 1
1
Se
p 1
7
Se
p 1
8
Se
p 2
5
Oct 2
Oct 4
Oct 5
Vehicle A Vehicle B Vehicle C
MIW
Vehicle G
Tucson met. area Narcotics Crimes
Customs and Border Protection Tucson met. area Crimes