CES 514 – Data Mining Spring 2010 Sonoma State University

CES 514 – Data Mining

Spring 2010Sonoma State University

Course Details: Instructor: Bala Ravikumar (Ravi)

Email: [email protected], [email protected] Tel: (707) 664 3335 Office: Darwin Hall 116 I

Course Web Page http://ravi.cs.sonoma.edu/~ravi/ces514sp10

Lecture time: 6 to 8:45 PM, Wednesdays

Room: Salazar Hall 2003 Office hours: M 9 – 10, T 11 – 12, W 5 – 6

Prerequisites

basic probability and statistics (probability distribution, random variable, conditional probability etc.)

algorithms and data structures (sorting, hashing, binary trees, algorithm design techniques)

Programming in high-level language (Java, Python, Matlab, c#, …)

Linear algebra (vectors, linear independence, matrix rank, Gaussian elimination etc.)

These topics will be reviewed. However, it will be helpful to spend some time on your own to familiarize yourself.

Text book

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

Web site for the text:

http://nlp.stanford.edu/IR-book/information-retrieval-book.html

This book’s focus is on WEB DATA MINING

Additional references

Mining the Web, S.Chakrabarti, MKP. Data Mining, Witten and Frank, MKP. The elements of statistical learning, Hastie, Tibshirani,

and Friedman, Springer-Verlag. Web Data Mining: Exploring Hyperlinks, Contents and

Usage data, Bing Liu, Springer-Verlag. Introduction to Data Mining, Pang-Ning Tan, Michael

Steinbach, and Vipin Kumar, Pearson/Addison Wesley.

Overlapping fields

Statistics Artificial intelligence (machine learning) Data base and Information retrieval Natural language processing Algorithm design and analysis

Grading

Quiz: 10% Home Work: 25 % Midterm: 15%

One mid-term, in-class, open book/notes? Final Exam: 25%

In-class or take-home? Project: 25%

Individual, design and implementation

Example Projects from Fall 2005 and 2007 Strategy for predicting the winner in a game similar to Jai Alai. Hand-written character recognition classify the type of disease based on some test results classification of e-mail (junk vs. useful, personal vs. business vs. family etc.) classification of questions in a multiple choice test based on the responses of students identifying the author from a sample text implement an association rule mining algorithm implement a visualization algorithm that provides various options for viewing the data classifying mushroom into edible and poisonous based on a number of attributes – such as color, length of the stem, width etc. classifying web site based on content

Project is done individually, and is semester long - implement, test, write a paper, present in class.

Today’s lecture

Overview of the course Chapter 1 of the text

Overview of Topics

Web data organization Web search Classification (supervised learning) Clustering (unsupervised learning) Association rule mining Language models for information retrieval Vector space models SVM and other tools LSI and tools from linear algebra Link analysis Other applications – e.g. bioinformatics

What is data mining?

Data mining is also called knowledge discovery Data mining is

extraction of useful patterns from data sources, e.g., databases, texts, web, images, etc.

Patterns must be: valid, novel, potentially useful, understandable

Our focus will be on text data (in particular web)

Some sample problems in Data Mining Extract useful knowledge from the vast data and information available on the web. (e.g. tagging of web sites, labeling images, predict the needs of a web surfer from pattern of clicks.) Using the financial record of a person, determine the risk involved in giving a loan. (decision could be yes or no. more generally, it could be the type of loan – interest rate, duration etc.) movie (book etc.) recommendation based on prior choices. prediction of weather, traffic pattern, outcome of an event etc. From the items recorded in the check-out counter of a super market, determine any correlation between items being sold. (used to decide which ones to put on sale.) study and understand of social networks. rank web page according to significance.

Classic data mining tasks

Classificationmining patterns that can classify future (new) data into known

classes. Association rule mining

mining any rule of the form X Y, where X and Y are sets of data items.

Clusteringidentifying similar groups in the data

Regression analysis

Classic data mining tasks (contd)

Sequential pattern mining:A sequential rule: A B, says that event A will be immediately

followed by event B with a certain confidence Deviation detection:

discovering the most significant changes in data Data visualization: using graphical methods to show

patterns in data.

Why is data mining important?

Computerization of businesses produce huge amount of data How to make best use of data? Knowledge discovered from data can be used for

competitive advantage. Online businesses generate even larger data sets

Online retailers (e.g., amazon.com) are largely driven by data mining.

Web search engines are information retrieval and data mining companies

Why is data mining necessary?

Make use of your data assets There is a big gap from stored data to knowledge;

and the transition won’t occur automatically. Many interesting things you want to find cannot be

found using database queries“find me people likely to buy my products”“Who are likely to respond to my promotion?”“Which movies should be recommended to each customer?”

Why data mining now?

The data is abundant. The computing power is not an issue. Data mining tools are available The competitive pressure is very strong.

Almost every company is doing (or has to do) it Socio-political exigencies

Detecting terrorism activities New technologies

Streaming data, mobile computing, wireless networks

Related fields

Data mining is an multi-disciplinary field: Machine learning/artificial intelligence Statistics Databases Information retrieval Visualization Natural language processing Game theory

etc.

Data mining applications Marketing: customer profiling and retention, identifying

potential customers, market segmentation. Engineering: identify causes of problems in products. Scientific data analysis: weather prediction, financial data

analysis, image analysis etc. Fraud detection: identifying credit card fraud, intrusion

detection. Text and web: a huge number of applications … Bioinformatics: structure prediction, classification,

microarray analysis etc. Any application that involves a large amount of data …

Structural descriptions Example: if-then rules

Age Spectacle prescription

Astigmatism Tear production

rate

Recommended lenses

Young Myope No Reduced None

Young Hypermetrope No Normal Soft

Pre-presbyopic

Hypermetrope No Reduced None

Presbyopic Myope Yes Normal Hard

… … … … …

If tear production rate = reducedthen recommendation = none

Otherwise, if age = young and astigmatic = no then recommendation = soft

Classification vs. association rules Classification rule:

predicts value of a given attribute (the classification of an example)

Association rule:predicts value of arbitrary attribute (or combination)

If outlook = sunny and humidity = highthen play = no

If temperature = cool then humidity = normalIf humidity = normal and windy = false

then play = yesIf outlook = sunny and play = no

then humidity = highIf windy = false and play = no

then outlook = sunny and humidity = high

A decision tree for this problem

Example: 209 different computer configurations

Linear regression function

Predicting CPU performance

Cycle time (ns)

Main memory (Kb)

Cache (Kb)

Channels Performance

MYCT MMIN MMAX CACH CHMIN CHMAX PRP

1 125 256 6000 256 16 128 198

2 29 8000 32000 32 8 32 269

…

208 480 512 8000 32 0 0 67

209 480 1000 4000 0 0 0 45

PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX

Spam filter software

Given below are the % of occurrences of a few select words in spam and genuine e-mail messages:

A decision list may be used to identify spam.

Text mining

Data mining on text Due to a huge amount of online texts on the Web and other

sources Text contains a huge amount of information of any

imaginable type! A major direction and tremendous opportunity!

Main topics Text classification and clustering Information retrieval Information extraction Opinion mining and summarization

Example: Opinion Mining

The Web has dramatically changed the way that people express their opinions.

One can post their opinions on almost anything at review sites, Internet forums, discussion groups, blogs, etc.

Product reviews Benefits of Review Analysis

Potential Customer: No need to read many reviews Product manufacturer: market intelligence, product

benchmarking

Feature Based Analysis & Summarization

Extracting product features (called Opinion Features) that have been commented on by customers.

Identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative.

Summarizing and comparing results.

An example GREAT Camera., Jun 3, 2004 Reviewer: jprice174 from Atlanta, Ga.

I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital. The pictures coming out of this camera are amazing. The 'auto' feature takes great pictures most of the time. And with digital, you're not wasting film if the picture doesn't come out. …

….

Summary:

Feature1: picturePositive: 12 The pictures coming out of this camera

are amazing. Overall this is a good camera with a

really good picture clarity. . . . .Negative: 2 The pictures come out hazy if your

hands shake even for a moment during the entire process of taking a picture.

Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange.

Feature2: battery life…

Visual Comparison

Summary of reviews of Digital camera 1

Picture Battery Size Weight Zoom

Comparison of reviews of

Digital camera 1

Digital camera 2

+

_

_

+

Information retrieval – Ch 1 Boolean query

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Unstructured data in 1680

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?

Why is that not the answer? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near

countrymen) not feasible Ranked retrieval (best documents to return)

Later lectures

31

Sec. 1.1

Term-document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

Sec. 1.1

Incidence vectors

So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar

and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100.

33

Sec. 1.1

Documents

CES 514 – Data Mining Spring 2010 Sonoma State University