49
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong

CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

CS570 Introduction to Data Mining

Department of Mathematics and Computer Science

Li Xiong

Page 2: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Today

� Meeting everybody in class

� Course topics

� Course logistics

1/18/2011 Data Mining: Concepts and Techniques 2

Page 3: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Meet your Instructor

� http://www.mathcs.emory.edu/~lxiong� Have taught the class twice� Research areas: data privacy and security, information managementinformation management

� Committed to help you learn� If you have any questions, concerns, suggestions

� Come to my office hours (Tu and Th 3-4pm, WSC E412)

� Email me ([email protected])

1/18/2011 Data Mining: Concepts and Techniques 3

Page 4: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Meet your co-instructor and TA

� Co-instructor: Sharath R Cholleti, PhD� Co-teach some lectures

� TA: Liyue Fan, PhD student� Grading assignments and projectsOffice hours: TBA � Office hours: TBA

1/18/2011 Data Mining: Concepts and Techniques 4

Page 5: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Meet everyone in class

� Your name� Your research area� Why are you here and what you hope to get out of the class

� Something interesting to share with the classSomething interesting to share with the class

1/18/2011 Data Mining: Concepts and Techniques 5

Page 6: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Today

� Meeting everybody in class

� Course topics

� Course logistics

1/18/2011 Data Mining: Concepts and Techniques 6

Page 7: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

What the class is about

� Classical data mining algorithms and techniques� Use and implementation

� Data mining on non-traditional data� Stream data mining, graph data mining

� Data mining in new domainsPrivacy preserving data mining, distributed data mining� Privacy preserving data mining, distributed data mining

� Data warehousing� Multi-dimensional view of a database

1/18/2011 Data Mining: Concepts and Techniques 7

Page 8: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Evolution of Sciences

� Before 1600, empirical science

� Knowledge must be based on observable phenomena

� Natural science vs. social sciences

� 1600-1950s, theoretical science

� Motivate experiments and generalize our understanding (e.g. theoretical physics)

� 1950s-now, computational science

� Traditionally meant simulation (e.g. computational physics)

1/18/2011 Data Mining: Concepts and Techniques 8

� Evolving to include information management

� 1960-now, data science

� Flood of data from new scientific instruments and simulations

� Ability to economically store and manage petabytes of data online

� Accessibility of the data through the Internet and computing Grid

� Scientific information management poses Computer Science challenges: acquisition,

organization, query, analysis and visualization of the data

Jim Gray and Alex Szalay, The World Wide Telescope, Comm. ACM, 45(11): 50-54, Nov.

2002

Page 9: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Evolution of Data and Information Science

� 1960s:

� Data collection, database creation, network DBMS

� 1970s:

� Relational data model, relational DBMS implementation

� 1980s:

� RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

1/18/2011 Data Mining: Concepts and Techniques 9

RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

� Application-oriented DBMS (spatial, scientific, engineering, etc.)

� 1990s:

� Data mining, data warehousing, multimedia databases, and Web

databases

� 2000s

� Stream data management and mining

� Data mining and its applications

� Web technology (XML, data integration) and global information systems

� Social networks

Page 10: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

A Brief History of Data Mining Society

� 1989 IJCAI Workshop on Knowledge Discovery in Databases

� Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,

1991)

� 1991-1994 Workshops on Knowledge Discovery in Databases

� Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.

Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

January 24, 2011 Data Mining: Concepts and Techniques 10

Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

� 1995-1998 International Conferences on Knowledge Discovery in Databases

and Data Mining (KDD’95-98)

� Journal of Data Mining and Knowledge Discovery (1997)

� ACM SIGKDD conferences since 1998 and SIGKDD Explorations

� More conferences on data mining

� PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM

(2001), etc.

� ACM Transactions on KDD since 2007

1/18/2011 10Data Mining: Concepts and Techniques

Page 11: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

What Is Data Mining?

� Data mining (knowledge discovery from data)

� Extraction of interesting (non-trivial, implicit, previously unknown and

potentially useful) patterns or knowledge from huge amount of data

� Data mining really means knowledge mining

� We are drowning in data, but starving for knowledge!

� Alternative names

1/18/2011 Data Mining: Concepts and Techniques 11

� Alternative names

� Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, information

harvesting, business intelligence, etc.

� Watch out: Is everything “data mining”?

� Simple search and query processing

� Small scale data analysis

� Data dredging, data fishing, data snooping

Page 12: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Knowledge Discovery (KDD) Process

Task-relevant Data

Data Mining

Pattern Evaluation

1/18/2011 Data Mining: Concepts and Techniques 12

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection and

transformation

Page 13: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Data Mining: Confluence of Multiple Disciplines

Statistics

MachineLearning

Artificial Intelligence

1/18/2011 Data Mining: Concepts and Techniques 13

Data Mining

Database Technology

OtherDisciplines

Visualization

Page 14: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Motivating challenges

� Scalability: massive datasets� Search strategies, novel data structures, disk-based algorithms, parallel algorithms

� High dimensionality: biological data, temporal or spatial data

� Heterogeneous data: semi-structured text, DNA � Heterogeneous data: semi-structured text, DNA data with sequential and 3D structure, networks

� Data distribution� Distributed data mining algorithms

� Data privacy� Privacy preserving data mining algorithms

1/18/2011 Data Mining: Concepts and Techniques 14

Page 15: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Multi-Dimensional View of Data Mining

� Data View: Data to be mined

� Relational, data warehouse, transactional, stream, object-

oriented/relational, spatial, time-series, text, multi-media,

heterogeneous, legacy, WWW

� Knowledge View: Knowledge to be mined

� Association, classification, clustering, trend/deviation, outlier analysis,

etc.

1/18/2011 Data Mining: Concepts and Techniques 15

etc.

� Multiple/integrated functions and mining at multiple levels

� Method View: Techniques utilized

� Database-oriented, data warehouse (OLAP), machine learning, statistics,

visualization, etc.

� Application View: Applications adapted

� Retail, telecommunication, banking, fraud analysis, bio-data mining, stock

market analysis, text mining, Web mining, etc.

Page 16: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Data Mining: On What Kinds of Data?

� Database-oriented data sets and applications

� Relational database

� Data warehouse

� Transactional database

� Advanced data sets and applications

� Data streams and sensor data

1/18/2011 Data Mining: Concepts and Techniques 16

� Data streams and sensor data

� Time-series data, temporal data, sequence data (incl. bio-sequences)

� Graphs, social networks and multi-linked data

� Heterogeneous databases and legacy databases

� Spatial data and spatiotemporal data

� Multimedia database

� Text databases

� The World-Wide Web

Page 17: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Data Mining Functionalities

� Predictive: predict the value of a particular attribute based on the values of other attributes� Classification� Regression

Descriptive: derive patterns that

1/18/2011 Data Mining: Concepts and Techniques 17

� Descriptive: derive patterns that summarize the underlying relationships in data� Cluster analysis� Association analysis

Page 18: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Classification and prediction

� Classification: construct models (functions) that describe

and distinguish classes for future prediction

� Prediction/regression: predict unknown or missing

numerical values

� Derived models can be represented as rules, mathematical

1/18/2011 Data Mining: Concepts and Techniques 18

formulas, etc.

� Topics

� Classification: Decision tree, Bayesian classification,

Neural networks, Support vector machines, kNN

� Regression: linear and non-linear regression

� Ensemble methods

Page 19: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Classification example

1/18/2011 19Data Mining: Concepts and Techniques

Page 20: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Frequent pattern mining and association analysis

� Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set

� Frequent sequential pattern

� Frequent structured pattern

� Applications

� Basket data analysis — Beer and diapers

Web log (click stream) analysis

Data Mining: Concepts and Techniques 20

� Web log (click stream) analysis

� DNA sequence analysis

� Challenge: efficient algorithms to handle exponential size of

the search space

� Topics

� Algorithms: Apriori, Frequent pattern growth, Vertical format

� Closed and maximal patterns

� Association rules mining

1/18/2011 20Data Mining: Concepts and Techniques

Page 21: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Cluster and outlier analysis

� Cluster analysis

� Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

� Unsupervised learning (vs. supervised learning)

� Maximizing intra-class similarity & minimizing interclass similarity

� Outlier analysis

� Outlier: Data object that does not comply with the general behavior

1/18/2011 Data Mining: Concepts and Techniques 21

� Outlier: Data object that does not comply with the general behavior of the data

� Noise or exception? Useful in fraud detection, rare events analysis

� E.g. Extreme large purchase

Page 22: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Clustering Analysis

� Topics� Partitioning based clustering: k-means� Hierarchical clustering: classical, BIRCH� Density based clustering: DBSCAN� Model-based clustering: EMCluster evaluation� Cluster evaluation

� Outlier analysis

1/18/2011 22Data Mining: Concepts and Techniques

Page 23: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Course topics

� Data preprocessing

� Classical data mining (machine learning) algorithms

� Frequent itemsets mining

� Classification

1/18/2011 Data Mining: Concepts and Techniques 23

� Cluster analysis

� Emerging applications

� Graph mining and social network analysis

� Privacy preserving data mining

� biomedical data mining

� Data warehousing and data cube computation

Page 24: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Graph Pattern Mining

� Frequent subgraph mining

� Finding frequent subgraphs within a single graph

� Finding frequent (sub)graphs in a set of graphs

� Applications of graph pattern mining

Biochemical structures, XML structures or Web

January 24, 2011 Mining and Searching Graphs in Graph Databases 24

� Biochemical structures, XML structures or Web

communities

� Building blocks for graph classification, clustering,

compression, comparison, and correlation analysis

1/18/2011 24Data Mining: Concepts and Techniques

Page 25: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Example: Frequent Subgraph Mining in Chemical Compounds

GRAPH DATASET

S

OH

O

O

N

O

N

HO

ONN

January 24, 2011 Mining and Searching Graphs in Graph Databases 25

FREQUENT PATTERNS

(MIN SUPPORT IS 2)

O

NHO

O

(A) (B) (C)

ONN

O

N

(1) (2)

1/18/2011 25Data Mining: Concepts and Techniques

Page 26: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Social Network Analysis

� Actor level: centrality, prestige, etc.

� Dyadic level: distance and reachability, etc.

� Triadic level: balance and transitivityand transitivity

� Subset level: cliques, cohesive subgroups, components

� Network level: connectedness, diameter, centralization, density, etc.

January 24, 2011 Social network analysis: methods and applications 261/18/2011 26Data Mining: Concepts and Techniques

Page 27: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Actor Centrality Example

� Degree Centrality� Betweenness

Centrality� Closeness

Centrality� Eigenvector

centrality

January 24, 2011 OrgNet.com 27

� Eigenvector centrality

1/18/2011 27Data Mining: Concepts and Techniques

Page 28: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Link Analysis on WWW

� Ranking algorithms� PageRank� HITS

1/18/2011 28Data Mining: Concepts and Techniques

Page 29: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Influence and Diffusion

1/18/2011 29CDC: Spread of Airborne DiseaseData Mining: Concepts and Techniques

Page 30: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Data mining is all good, but what about privacy?

� Privacy is not only a concern but a phenomenon� AOL data release� Netflix challenge

� Topics: algorithms that allow data mining while preserving individual information

Perturbation� Perturbation� Generalization

� Challenge: tradeoff between privacy, accuracy, and efficiency

1/18/2011 Data Mining: Concepts and Techniques 30

Page 31: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

A Face is exposed for AOL searcher No. 4417749

� 20 million Web search queries by AOL (650k~ users)

� User 4417749� “numb fingers”, � “60 single men”� “dog that urinates on everything”� “landscapers in Lilburn, Ga”� Several people names with last name Arnold� Several people names with last name Arnold� “homes sold in shadow lake subdivision gwinnett

county georgia”

Thelma Arnold, a 62-year-old

widow who lives in Lilburn,

Ga., frequently researches

her friends’ medical ailments

and loves her dogs

1/18/2011 31Data Mining: Concepts and Techniques

Page 32: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Course topics

� Data preprocessing

� Classical data mining (machine learning) algorithms

� Frequent itemsets mining

� Classification

1/18/2011 Data Mining: Concepts and Techniques 32

� Cluster analysis

� Emerging applications

� Graph mining and social network analysis

� Privacy preserving data mining

� biomedical data mining

� Data warehousing and data cube computation

Page 33: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

What is a Data Warehouse?

� “A data warehouse is a subject-oriented, integrated, time-variant,

and nonvolatile collection of data in support of management’s

decision-making process.”—W. H. Inmon

� Key aspects

A decision support database that is maintained separately from

1/18/2011 Data Mining: Concepts and Techniques 33

� A decision support database that is maintained separately from

the organization’s operational database

� Support OLAP (vs. OLTP)

� Data warehousing:

� The process of constructing and using data warehouses

Page 34: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Multi Dimensional View: From Tables to Data Cubes

� A data warehouse is based on a multidimensional data model which

views data in the form of a data cube

� A data cube, such as sales, allows data to be modeled and viewed in

multiple dimensions

1/18/2011 Data Mining: Concepts and Techniques 34

Item

Time

Page 35: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Cube: A Lattice of Cuboids

all

time item location supplier

0-D(apex) cuboid

1-D cuboids

� An n-D base cube is called a base cuboid. The top most 0-D cuboid, which

holds the highest-level of summarization, is called the apex cuboid. The

lattice of cuboids forms a data cube.

1/18/2011 Data Mining: Concepts and Techniques 35

time,item

time,item,location

time, item, location, supplier

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 36: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Course topics

� Data preprocessing

� Classical data mining (machine learning) algorithms

� Frequent itemsets mining

� Classification

January 24, 2011 Data Mining: Concepts and Techniques 36

� Cluster analysis

� Emerging applications

� Graph mining and social network analysis

� Privacy preserving data mining

� biomedical data mining

� Data warehousing and data cube computation

1/18/2011 36Data Mining: Concepts and Techniques

Page 37: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Top-10 Most Popular DM Algorithms:18 Identified Candidates (I)

� Classification

� #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993.

� #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984.

� #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)

#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid

1/18/2011 Data Mining: Concepts and Techniques 37

� #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398.

� Statistical Learning

� #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.

� #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis

� #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.

� #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00.

Page 38: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

The 18 Identified Candidates (II)

� Link Mining

� #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998.

� #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998.

� Clustering

#11. K-Means: MacQueen, J. B., Some methods for classification

1/18/2011 Data Mining: Concepts and Techniques 38

� #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967.

� #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96.

� Bagging and Boosting

� #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

Page 39: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

The 18 Identified Candidates (III)

� Sequential Patterns

� #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996.

� #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01.

1/18/2011 Data Mining: Concepts and Techniques 39

Prefix-Projected Pattern Growth. In ICDE '01.

� Integrated Mining

� #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98.

� Rough Sets

� #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992

� Graph Mining

� #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.

Page 40: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Top-10 Algorithm Selected at ICDM’06

� #1: C4.5 Decision Tree - Classification (61 votes)

� #2: K-Means - Clustering (60 votes)

� #3: SVM – Classification (58 votes)

� #4: Apriori - Frequent Itemsets (52 votes)

#5: EM – Clustering (48 votes)

1/18/2011 Data Mining: Concepts and Techniques 40

� #5: EM – Clustering (48 votes)

� #6: PageRank – Link mining (46 votes)

� #7: AdaBoost – Boosting (45 votes)

� #7: kNN – Classification (45 votes)

� #7: Naive Bayes – Classification (45 votes)

� #10: CART – Classification (34 votes)

Page 41: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Today

� Meeting everybody in class

� Course topics

� Course logistics

1/18/2011 Data Mining: Concepts and Techniques 41

Page 42: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Textbook

� Data mining: concepts and techniques. J. Han and M. Kamber. Second edition

1/18/2011 Data Mining: Concepts and Techniques 42

Page 43: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Other Reference Books

� S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan

Kaufmann, 2002

� R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

� T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

� U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and

Data Mining. AAAI/MIT Press, 1996

� U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge

Discovery, Morgan Kaufmann, 2001

1/18/2011 Data Mining: Concepts and Techniques 43

Discovery, Morgan Kaufmann, 2001

� D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

� T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,

and Prediction, Springer-Verlag, 2001

� B. Liu, Web Data Mining, Springer 2006.

� T. M. Mitchell, Machine Learning, McGraw Hill, 1997

� G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991

� P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

� S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

� I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java

Implementations, Morgan Kaufmann, 2nd ed. 2005

Page 44: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Conferences and Journals on Data Mining

� KDD Conferences

� ACM SIGKDD Conf. on Knowledge Discovery in Databases and Data Mining (KDD)

� SIAM Data Mining Conf. (SDM)

� (IEEE) Conf. on Data Mining (ICDM)

� Other related conferences

� ACM SIGMOD

� VLDB

� (IEEE) ICDE

� WWW, SIGIR

ICML, CVPR, NIPS

January 24, 2011 Data Mining: Concepts and Techniques 44

(ICDM)

� Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD)

� Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)

� ACM Conference on Applied Computing(SAC), Data Mining track

� ICML, CVPR, NIPS

� Journals

� Data Mining and Knowledge

Discovery (DAMI or DMKD)

� IEEE Trans. On Knowledge

and Data Eng. (TKDE)

� KDD Explorations

� ACM Trans. on KDD

1/18/2011 44Data Mining: Concepts and Techniques

Page 45: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Resources (Google, DBLP, conferences)

� Data mining and KDD (SIGKDD: CDROM)� Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.

� Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD

� Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)� Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA

� Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.

� AI & Machine Learning� Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.

1/18/2011 Data Mining: Concepts and Techniques 45

� Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.

� Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.

� Web and IR� Conferences: SIGIR, WWW, CIKM, etc.

� Journals: WWW: Internet and Web Information Systems,

� Statistics� Conferences: Joint Stat. Meeting, etc.

� Journals: Annals of statistics, etc.

� Visualization� Conference proceedings: CHI, ACM-SIGGraph, etc.

� Journals: IEEE Trans. visualization and computer graphics, etc.

Page 46: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Workload

� 2-3 programming assignments (individual)� Implementation of classical algorithms and competition!

� 2-3 written/reading assignments� 1 paper presentation� 1 open-ended course project (team of up to 2 students) with project presentation

� 1 open-ended course project (team of up to 2 students) with project presentation� Application and evaluation of existing algorithms to interesting data

� Design of new algorithms to solve new problems� Survey of a class of algorithms

� 1 midterm� No final exam

1/18/2011 46Data Mining: Concepts and Techniques

Page 47: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Grading

� Assignments/presentations 40� Final project 30� Midterm 30

1/18/2011 47Data Mining: Concepts and Techniques

Page 48: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Late Policy

� Late assignment will be accepted within 3 days of the due date and penalized 10% per day

� 1 late assignment allowance, can be used to turn in a single late assignment within 3 days of the due date without penalty. days of the due date without penalty.

1/18/2011 48Data Mining: Concepts and Techniques

Page 49: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/01_intro.pdf · Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky -Shapiro,

Today

� Meeting everybody in class

� Course topics

� Course logistics

1/18/2011 Data Mining: Concepts and Techniques 49

� Next lecture: data preprocessing