37
Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Embed Size (px)

Citation preview

Page 1: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Instructor: Jinze LiuSpring 2014

CS 685 Special Topics in Data mining

Page 2: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Welcome!Instructor: Jinze Liu

Homepage: http://www.cs.uky.edu/~liujOffice: 235 Hardymon BuildingEmail: [email protected]

2

Page 3: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

OverviewTime: TR 2pm-3:15pmOffice hour: by appointmentCredit: 3 Preferred Prerequisite:

Data structure, Algorithms, Database, AI, Machine Learning, Statistics.

3

Page 4: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Overview Textbook:

Mining of Massive Datasets. Can be accessed for free at http://infolab.stanford.edu/~ullman/mmds/book.pdf

A collection of papers in recent conferences and journals

Other ReferencesData Mining --- Concepts and techniques, by Han and

Kamber, Morgan Kaufmann. (ISBN:1-55860-901-6) Introduction to Data Mining, by Tan, Steinbach, and Kumar,

Addison Wesley. (ISBN:0-321-32136-7)Principles of Data Mining, by Hand, Mannila, and Smyth, MIT

Press. (ISBN:0-262-08290-X) The Elements of Statistical Learning --- Data Mining,

Inference, and Prediction, by Hastie, Tibshirani, and Friedman. (ISBN:0-387-95284-5)

4

Page 5: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

OverviewGrading scheme

5

4-6 Homeworks

30%

1 Exam 20%

1 Presentation

20%

1 Project 30%

Page 6: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

OverviewProject

Individual project or team project (no more than 2 students)

OptionsDevelopment of original algorithmsApplication of existing algorithms to solve a

real world problem.

6

Page 7: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

ExamplesMIT big data challenge

http://bigdatachallenge.csail.mit.edu/datasets

Data visualizationSAP Lumirahttp://scn.sap.com/community/lumira/blog/

2014/01Dataset examples:

http://scn.sap.com/docs/DOC-31433Digging into reviews?

http://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset

7

Page 8: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Overview

8

Paper presentation One per student

A talk on one of the latest research development in data mining. It should also be related to your project.

Research paper(s)Your own pick (upon approval)

Three partsMotivation for the researchReview of data mining methodsDiscussionQuestions and comments from audience Class participation: One question/comment per student

Order of presentation: will be arranged according to the topics.

Page 9: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Introduction to Data Mining

9

Page 10: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Why Mine Data? Lots of data is being collected

and warehoused Public web data

Social Networks Reviews

Purchases at department/grocery stores

Bank/Credit Card transactions

……

Storage and computing have become cheaper and more powerful

Page 11: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

The Growth of Data

11

Page 12: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

The Big Data EconomyThe basis for competitive advantage

Customer profiling and targeting as well as predictive analytics

Optimize service and maintenance (e.g. GE)New products, features and value-adding

services (e.g. LinkedIn, Facebook and others. )

12

http://www.ijento.com/blog/what-is-big-data-a-guide-for-cmos/

Page 13: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Data Scientist Data scientist – the sexiest job of the 21st

century (Harvard business review)To find insight in data Skills needed?

writing codebeing curious about discoverysolid foundation in math, statistics, probability and

computer sciencecommunication

13

http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

Page 14: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

What is Data Mining?Non-trivial extraction of implicit,

previously unknown and potentially useful information from data

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 15: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

15

CulturesDatabases: concentrate on large-scale

(non-main-memory) data.AI (machine-learning): concentrate on

complex methods, small data.Statistics: concentrate on models.

Page 16: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

16

Models vs. Analytic Processing

To a database person, data-mining is an extreme form of analytic processing – queries that examine large amounts of data.Result is the query answer.

To a statistician, data-mining is the inference of models.Result is the parameters of the model.

Page 17: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

(Way too Simple) ExampleGiven a billion numbers, a DB person

would compute their average and standard deviation.

A statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation of that distribution.

17

Page 18: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

ExamplesDiscuss whether or not each of the following activities is a data mining task.

(a) Dividing the customers of a company according to their gender.(b) Dividing the customers of a company according to their profitability.(c) Predicting the future stock price of a company using historical records.

Page 19: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Examples(a) Dividing the customers of a company according to their gender.

No. This is a simple database query.

(b) Dividing the customers of a company according to their profitability.

No. This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be data mining.

(c) Predicting the future stock price of a company using historical records.

Yes. We would attempt to create a model that can predict the continuous value of the stock price. This is an example of the area of data mining known as predictive modelling. We could use regression for this modelling, although researchers in many fields have developed a wide variety of techniques for predicting time series.

Page 20: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Meaningfulness of AnswersA big data-mining risk is that you will

“discover” patterns that are meaningless.

Statisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

20

Page 21: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Examples of Bonferroni’s PrincipleA big objection to TIA was that it was

looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy.

The Rhine Paradox: a great example of how not to conduct scientific research.

21

Page 22: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

The “TIA” StorySuppose we believe that certain groups

of evil-doers are meeting occasionally in hotels to plot doing evil.

We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.

22

Page 23: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

The Details109 people being tracked.1000 days.Each person stays in a hotel 1% of the

time (10 days out of 1000).Hotels hold 100 people (so 105 hotels).If everyone behaves randomly (I.e., no

evil-doers) will the data mining detect anything suspicious?

23

Page 24: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

24

Calculations – (1)Probability that given persons p and q

will be at the same hotel on given day d :1/100 1/100 10-5 = 10-9.

Probability that p and q will be at the same hotel on given days d1 and d2:10-9 10-9 = 10-18.

Pairs of days:5105.

p atsomehotel

q atsomehotel Same

hotel

Page 25: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

25

Calculations – (2)

Probability that p and q will be at the same hotel on some two days:5105 10-18 = 510-13.

Pairs of people:51017.

Expected number of “suspicious” pairs of people:51017 510-13 = 250,000.

Page 26: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

26

ConclusionSuppose there are (say) 10 pairs of evil-

doers who definitely stayed at the same hotel twice.

Analysts have to sift through 250,010 candidates to find the 10 real cases.Not gonna happen.But how can we improve the scheme?

Page 27: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

27

MoralWhen looking for a property (e.g., “two

people stayed at the same hotel twice”), make sure that the property does not allow so many possibilities that random data will surely produce facts “of interest.”

Page 28: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

28

Rhine Paradox – (1)Joseph Rhine was a parapsychologist in the

1950’s who hypothesized that some people had Extra-Sensory Perception.

He devised (something like) an experiment where subjects were asked to guess 10 hidden cards – red or blue.

He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right!

Page 29: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

29

Rhine Paradox – (2)He told these people they had ESP and

called them in for another test of the same type.

Alas, he discovered that almost all of them had lost their ESP.

What did he conclude?Answer on next slide.

Page 30: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

30

Rhine Paradox – (3)He concluded that you shouldn’t tell

people they have ESP; it causes them to lose it.

Page 31: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

31

MoralUnderstanding Bonferroni’s Principle will

help you look a little less stupid than a parapsychologist.

Page 32: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Data Mining TasksPrediction Methods

Use some variables to predict unknown or future values of other variables.

Description MethodsFind human-interpretable patterns that

describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 33: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Data Mining Tasks...Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Regression [Predictive]

Semi-supervised LearningSemi-supervised ClusteringSemi-supervised Classification

Page 34: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Data Mining Tasks Cover in this CourseClassification [Predictive]

Association Rule Discovery [Descriptive]

Clustering [Descriptive]

Deviation Detection [Predictive]

Semi-supervised LearningSemi-supervised ClusteringSemi-supervised Classification

Page 35: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Survey Why are you taking this course?

What would you like to gain from this course?

What topics are you most interested in

learning about from this course?

Any other suggestions?

Page 36: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

KDD References

36

Data mining and KDD (SIGKDD: CDROM)Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD,

PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD

Explorations

Database systems (SIGMOD: CD ROM)Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE,

EDBT, ICDT, DASFAA Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.

AI & Machine LearningConferences: Machine learning (ICML), AAAI, IJCAI, COLT

(Learning Theory), etc. Journals: Machine Learning, Artificial Intelligence, etc.

Page 37: Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

KDD References

37

StatisticsConferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc.

BioinformaticsConferences: ISMB, RECOMB, PSB, CSB, BIBE, etc. Journals: J. of Computational Biology, Bioinformatics, etc.

VisualizationConference proceedings: InfoVis, CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics,

etc.