Data Mining - Massey University Introduction to Data Mining Instructor: Dr. Chris Volinsky

Data Mining - Massey University

Introduction to Data Mining

Instructor: Dr. Chris Volinsky


Class Structure

• Class structure– 9AM - 11AM Lecture– 11AM - 12PM Computer Lab– 12PM Lunch– 1PM - 3PM Lecture– 3PM - 4PM Computer Lab– 4PM - 5PM Recap and discussion

• Exams / Grades based on– 40% data analysis project– 30% technical paper presentation– 30% pop quizzes and exams

• given at the beginning of each class• questions about broad concepts


Class Schedule

• 24-26 July: Block 1• 4-6 September: Block 2• 6 September: Presentation of technical

paper and data analysis proposal to class

• 19 September: Data Analysis project due (to via email)

• I will be available intermittently from 1-24 August


Course Objectives

• Direct Objectives:– To learn data mining techniques– To see their use in real-world/research applications– To understand limitations of standard statistical

techniques in data mining applications– To get an understanding of the methodological

principles behind data mining– To be able to read and understand methodological

work in scholarly journal papers– To implement & use data mining models using

statistical software (specifically R)


Project #1:Data Analysis Project

• The goal of data mining is to find interesting patterns in data. You will be required to:– Define a scientific question of interest– Collect a data set (probably online)– Prepare the data set properly– Analyze the data using appropriate models– Write a 5-10 page report on your analysis (graphics

included)

• Project proposals (1/2 -1 page) will be due at the beginning of the second block.

• Present proposal (5 minutes) to class on 6 September.

• Finished reports will be due 19 September.


Project #2:Scientific Paper Presentation

• Select a technical paper about a data mining technique (list provided on web site).– Read and understand the paper– Write a one-page summary of the paper– Present the basic ideas of the paper to the class (10-15

minutes)• Emphasis to be placed on the motivation for a particular

statistical methodology within the application context:– What is the general objective of the paper?– What data are they using?– What statistical approach/method is proposed? Why?– What has been done in the past?– How does the paper accomplish new domain insight using

that method?

• Paper presented to class on last day of lecture: 9 September


Class Web Site

• http://www.research.att.com/~volinsky/DataMining

• Lists of papers for presentation• Announcements• Links to other data mining course notes, R

tutorials, resources

• Email:– [email protected]


Resources• Data mining is a new field and as such, does not

have authoritative texts (yet).• This class draws from many sources, best are

– “Handbook of Data Mining” Hand, Mannila and Smyth – “Elements of Statistical Learning” Hastie, Tibshirani,

and Friedman– “Interactive and Dynamic Graphics for Data Analysis”

Cook and Swayne– Also good class notes available from other classes:

• David Madigan, Rutgers• Di Cook, Iowa State• Padhraic Smyth, UC Irvine• Jiawei Han, Simon Fraser

– see class web site for pointers to these notes, or just Google them!)

• Also many good tutorials and books on R (or S/Splus), both online and in the library.


Course Outline• 6 days = 12 “units” each unit is a lecture and a lab• Units:

– Intro to Data Mining– Data exploration and visualization– Data Mining Concepts– Regression Topics– Classification and Supervised Learning– Clustering and Unsupervised Learning– Text Mining and Information Retrieval– Web Mining and Social Networks 1– Web Mining and Social Networks 2– Assorted Topics

• Advanced Classification - Neural networks, ensemble methods• Association Rules• Telecommunications Fraud• Proximity models for social networks• Support Vector machines


What is Data Mining?

• Not well defined….• Hand, Mannila, Smyth:

– “data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data

owner”

• Isn’t that the same as statistics?


Data Mining Enablers

• Explosion of data• Fast and cheap computation and storage

– Moore’s Law: processing doubles every 19 months– Disk storage doubles every 9 months– Database technology

• Competitive pressure in business• New, successful models

– SVM, boosting

• Commercial products• SAS, SPSS, Insightful, IBM, Oracle

– Open Source products• Weka • R

1E+3

1E+4

1E+5

1E+6

1E+7

1988 1991 1994 1997 2000

disk TB growth: 112%/y

Moore's Law: 58.7%/y

ExaByte

Disk TB Shipped per Year1998 Disk Trend (Jim Porter)

http://www.disktrend.com/pdf/portrpkg.pdf.

Both from NZ!


Data Mining vs. Statistics

• Statistics is known for:– well defined hypotheses used to learn about a– specifically chosen population studied using– carefully collected data providing inferences with– well known properties.

• Data mining isn’t that careful. It is:– data driven discovery of– models and patterns from– massive and– observational data sets


Two Types of Data

• Experimental Data– Hypothesis H– design an experiment to test H– collect data, infer how likely it is that H is true– e.g., clinical trials in medicine

• Observational or Retrospective or Secondary Data– massive non-experimental data sets

• e.g., Web logs, human genome, atmospheric simulations, etc

– assumptions of experimental design no longer valid– how can we use such data to do science?

• use the data to support model exploration, hypothesis testing


Data-Driven Discovery

• Observational data– cheap relative to experimental data

• Examples: – Transaction data archives for retail stores, airlines,

etc– Web logs for Amazon, Google, etc– The human/mouse/rat genome– Etc., etc

makes sense to leverage available data useful (?) information may be hidden in vast

archives of data


Data Mining v. Statistics

• Traditional statistics– first hypothesize, then collect data, then analyze– often model-oriented (strong parametric models)

• Data mining: – few if any a priori hypotheses– data is usually already collected a priori– analysis is typically data-driven not hypothesis-driven– Often algorithm-oriented rather than model-oriented

• Different?– Yes, in terms of culture, motivation: however…..– statistical ideas are very useful in data mining, e.g., in validating

whether discovered knowledge is useful – Increasing overlap at the boundary of statistics and DM

e.g., exploratory data analysis (based on pioneering work of John Tukey in the 1960’s)


Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Different fields have different views of what data mining is


Data Data Data

• It’s all about the data - where does it come from?– www– NASA– Business processes/transactions– Telecommunications and networking– Medical imagery– Government, census, demographics– Sensor networks, RFID tags– sports


Flat File or Vector Data

• Rows = objects• Columns = measurements on objects

– Represent each row as a p-dimensional vector, where p is the dimensionality

• In efffect, embed our objects in a p-dimensional vector space• Often useful, but not always appropriate

• Both n and p can be very large in data mining• Matrix can be quite sparse

n

p


Sparse Matrix (Text) Data

20 40 60 80 100 120 140 160 180 200

50

100

150

200

250

300

350

400

450

500

Word IDs

TextDocuments


128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

…

5115

11111151511151

77777777

111333

3333131113332232

…

User 5

User 4

User 3

User 2

User 1

Sequence (Web) Data


Time Series Data

0 5 10 15 20 25 3040

60

80

100

120

140

160

TIME

X-POSITION

TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS


Image Data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.


Spatio Temporal Data

• http://senseable.mit.edu/nyte/nyte-globe-encounters.mov




Relational Data

Algorithms for estimating relative importance in networks S. White and P. Smyth, ACM SIGKDD, 2003.





Examples of Data Mining Successes

• Market Basket (WalMart)• Recommender Systems (Amazon.com)• Fraud Detection in Telecommunications

(AT&T)• Target Marketing / CRM• Financial Markets• DNA Microarray analysis• Biometrics (fingerprinting, handwriting)• Web Traffic / Blog analysis


Examples of Data Mining Successes

• Google is a company built on data mining• PageRank mined the web to build better

search• Google as spell checker• Google as ad placer• Google as news aggregator• Google as face recognizer



http://images.google.co.nz/imgres?imgurl=http://library2go.files.wordpress.com/2008/04/google.jpg&imgrefurl=http://library2go.wordpress.com/2008/04/18/bookmarking-neednt-be-delicious/&h=478&w=1197&sz=204&hl=en&start=2&um=1&tbnid=ylwf-XkJVe34QM:&tbnh=60&tbnw=150&prev=/images%3Fq%3Dgoogle%26um%3D1%26hl%3Den%26sa%3DN


The Data Mining Process

• Often called KDD - Knowledge Discovery in Databases

• Analysis is just one part of the process– Data collection and storage– Data cleaning– Data sampling– Analysis– Decision making


Different Data Mining Tasks

• Exploratory Data Analysis

• Descriptive Modeling

• Predictive Modeling

• Discovering Patterns and Rules

• + others….


Exploratory Data Analysis

• Getting an overall sense of the data set– Computing summary statistics:

• Number of distinct values, max, min, mean, median, variance, skewness,..

• Visualization is widely used– 1d histograms– 2d scatter plots– Higher-dimensional methods

• Useful for data checking– E.g., finding that a variable is always integer valued or

positive– Finding the some variables are highly skewed

• Simple exploratory analysis can be extremely valuable– You should always “look” at your data before applying any

data mining algorithms


Example of Exploratory Data Analysis(Pima Indians data, scatter plot matrix)


Descriptive Modeling

• Goal is to build a “descriptive” model – e.g., a model that could simulate the data if

needed– models the underlying process

• Examples:– Density estimation:

• estimate the joint distribution P(x1,……xp)

– Cluster analysis:• Find natural groups in the data

– Dependency models among the p variables• Learning a Bayesian network for the data


Example of Descriptive Modeling

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red Blood Cell Hemoglobin Concentration

Anemia Group

Control Group


Example of Descriptive Modeling

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS



Anemia Group

Control Group

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4



EM ITERATION 25


WebCanvas algorithm and software - currently in new SQLServer







• + others….


Predictive Modeling

• Predict one variable Y given a set of other variables X– Here X could be a p-dimensional vector

– Classification: Y is categorical– Regression: Y is real-valued

• In effect this is function approximation, learning the relationship between Y and X

• Many, many algorithms for predictive modeling in statistics and machine learning

• Often the emphasis is on predictive accuracy, less emphasis on understanding the model


Predictive Modeling: Fraud Detection

• Telecommunications fraud detection– Fraud costs telecommunication companies US$ Billions per

year– very few transactions are fraudulent, but they are costly

• Approach– For each transaction estimate “fraudiness”.– Based on known fraud AND known user behavior– High probability cases investigated by fraud police

• Example models:– anomaly detection– guilt by association

• Issues– Significant feature engineering/preprocessing – false alarm rate vs missed detection – what is the tradeoff?


Predictive Modeling: Other Examples

• Risk Management:– Example: Credit card company wants to do risk

management. – How would you do this?

• Netflix Prize– US$1M prize to make better movie

recommendations. – How would you do this?







• + others….


Pattern Discovery

• Goal is to discover interesting “local” patterns in the data rather than to characterize the data globally

• given market basket data we might discover that

• If customers buy wine and bread then they buy cheese with probability 0.9

• These are known as “association rules”

• Given multivariate data on astronomical objects

• We might find a small group of previously undiscovered objects that are very self-similar in our feature space, but are very far away in feature space from all other objects


Example of Pattern Discovery

• IBM “Advanced Scout” System– Bhandari et al. (1997)– Every NBA basketball game is annotated,

• e.g., time = 6 mins, 32 seconds event = 3 point basket player = Michael Jordan

• This creates a huge untapped database of information

– IBM algorithms search for rules of the form “If player A is in the game, player B’s scoring rate increases from 3.2 points per quarter to 8.7 points per quarter”


Data Mining Pitfalls

• Is data mining always necessary– Just because you have a terabyte doesn’t

mean you need to use it.

• Privacy concerns– Differ by country, industry, application,

generation

• Meaningfulness of patterns unclear– Rhine paradox– Terrorism– DM has a lot to learn from statistics!


Rhine Paradox

• David Rhine: parapsychologist who studied ESP (he was a believer!)

• He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue.

• He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!

• He told these people they had ESP and called them in for another test of the same type.

• Alas, he discovered that almost all of them had lost their ESP.

• What is the conclusion?


Data Mining Pitfalls

• PR Problems: data mining as a four letter word?

– ...increasingly people’s data is at risk. The old ways ...are still at use like dumpster diving, stealing from mailboxes, physical theft, and credit card receipt copying. New tactics include disparate techniques of phishing, email fraud, data mining, spam, key-logging and an array of other technological processes. - Steven D. Domenikos, IdentityTruth, 2008

– One place oversight is sorely lacking is in the whole matter of data mining. ...What have they contributed? Not a single case comes to mind in which security services apprehended a terrorist following identification by data mining. ...that huge database will be out there, win or lose, for some government agency to divert to its purposes or some hacker to turn to private gain or crime. - John Prados, TomPaine.com


Fighting Terrorism in the US

• US Government is widely known to be collecting lots of data on Americans and using data mining to look for patterns consistent with terrorist activity.

• Bruce Schneier, Wired Magazine, “Why Data Mining Won’t Stop Terror”:

• Assume:– 1 in 100 false positive (99% precision)– 1 in 1000 false negative– 1 trillion events (phone calls, credit card transactions,

emails) per day – 10 are really terrorist plots

• Then:– 1 billion false alarms for every true plot uncovered– 27 million leads daily– Even if 99.9999% precision = 2,750 false alarms


Data Mining Software:Introduction to R


Data Mining Software• What is R?

– Open source statistical software– Grew out of S, S+– www.r-project.org– http://cran.stat.auckland.ac.nz/

• R Tutorials available online (see website and CRAN)

• Great graphics

http://www.r-project.org/

http://cran.stat.auckland.ac.nz/


R examplesx=5y=rnorm(1000,-1,3.5)hist(y)?histhist(y, nclass=20, col=‘orange’)> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -12.7400 -3.3080 -0.8247 -0.8101 1.5820 10.5500> t.test(y,mu=0)

data: y t = -7.1286, df = 999, p-value = 1.942e-12alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.0330474 -0.5870667 sample estimates:mean of x -0.810057

mydata = read.table(“iris.dat”,sep=“ “)> summary(mydata) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500






R Examples

• Data stored in “data frames”– read.table or read.csv reads data into a data frame

• Check the contents of ‘mydata’– summary(mydata)– names(mydata) – lists all the variable names– mydata[1,] – shows only the first row of data– mydata[,1] – shows only the first column– mydata[,2:5] selects columns– mydata$Sepal.Length – shows all values of the variable

“Sepal.Length”– attach(mydata) - allows you to access Sepal.Length and

other columns directly– mydata[Sepal.Length > 3,] selects conditional rows


R Examples• Modelling requires formula notation lm.out = lm(y~x)


Lab #1

• R Tutorial– Courtesy of Di Cook– Work your way through the R tutorial (intro-

R.pdf). – Code is available in the file introductory-

code.txt

– Input spam data (collected at Iowa State University)

– Find summaries and simple manipulations

– Write R functions using function() and apply()