413
Introduction

Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Introduction

Page 2: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Mining -- Introduction 2

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)

Product information (Product-id, category, manufacturer, made-in, stock-price, …)

Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)

Business queries:

Page 3: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Techniques: Business Intelligence

•  Multidimensional data analysis •  Online query answering •  Interactive data exploration

Jian Pei: CMPT 741/459 Data Mining -- Introduction 3

Page 4: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Motivation: Store Layout Design

Jian Pei: CMPT 741/459 Data Mining -- Introduction 4

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Page 5: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Techniques: Store Layout Design

•  Customer purchase patterns •  Business strategies

Jian Pei: CMPT 741/459 Data Mining -- Introduction 5

Page 6: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Motivation: Community Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction 6

http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811

Page 7: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Techniques: Community Detection

•  Similarity between objects •  Partitioning objects into groups

– No guidance about what a group is

Jian Pei: CMPT 741/459 Data Mining -- Introduction 7

Page 8: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Motivation: Disease Prediction

Jian Pei: CMPT 741/459 Data Mining -- Introduction 8

Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …

What medical problems does this patient has?

Page 9: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Techniques: Disease Prediction

•  Features •  Model

Jian Pei: CMPT 741/459 Data Mining -- Introduction 9

Page 10: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Motivation: Fraud Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction 10

http://i.imgur.com/ckkoAOp.gif

Page 11: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Techniques: Fraud Detection

•  Features •  Dissimilarity •  Groups and noise

Jian Pei: CMPT 741/459 Data Mining -- Introduction 11

http://i.stack.imgur.com/tRDGU.png

Page 12: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

What Is Data Science About?

•  Data •  Extraction of knowledge from data •  Continuation of data mining and knowledge

discovery from data (KDD)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 12

Page 13: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

What Is Data?

•  Values of qualitative or quantitative variables belonging to a set of items

•  Represented in a structure, e.g., tabular, tree or graph structure

•  Typically the results of measurements •  As an abstract concept can be viewed as the

lowest level of abstraction from which information and then knowledge are derived

Jian Pei: CMPT 741/459 Data Mining -- Introduction 13

Page 14: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

What Is Information?

•  “Knowledge communicated or received concerning a particular fact or circumstance”

•  Conceptually, information is the message (utterance or expression) being conveyed

•  Cannot be predicted •  Can resolve uncertainty

Jian Pei: CMPT 741/459 Data Mining -- Introduction 14

Page 15: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

What Is Knowledge?

•  Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education

•  Implicit knowledge: practical skill or expertise •  Explicit knowledge: theoretical

understanding of a subject

Jian Pei: CMPT 741/459 Data Mining -- Introduction 15

Page 16: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Systems

•  A data system answers queries based on data acquired in the past

•  Base data – the rawest data not derived from anywhere else

•  Knowledge – information derived from the base data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 16

Page 17: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Dealing with Data – Querying

•  Given a set of student records about name, age, courses taken and grades

•  Simple queries – What is John Doe’s age?

•  Aggregate queries – What is the average GPA of all students at this

school? •  Queries can be arbitrarily complicated

– Find the students X and Y whose grades are less than 3% apart in as many courses as possible

Jian Pei: CMPT 741/459 Data Mining -- Introduction 17

Page 18: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Queries

•  A precise request for information •  Subjects in databases and information

retrieval – Databases: structured queries on structured

(e.g., relational) data –  Information retrieval: unstructured queries on

unstructured (e.g., text, image) data •  Important assumptions

–  Information needs – Query languages

Jian Pei: CMPT 741/459 Data Mining -- Introduction 18

Page 19: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data-driven Exploration

•  What should be the next strategy of a company? – A lot of data: sales, human resource, production,

tax, service cost, … •  The question cannot be translated into a

precise request for information (i.e., a query) •  Developing familiarity (knowledge) and

actionable items (decisions) by interactively analyzing data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 19

Page 20: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data-driven Thinking

•  Starting with some simple queries •  New queries are raised by consuming the

results of previous queries •  No ultimate query in design!

– But many queries can be answered using DB/IR techniques

Jian Pei: CMPT 741/459 Data Mining -- Introduction 20

Page 21: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

The Art of Data-driven Thinking

•  The way of generating queries remains an art! – Different people may derive different results

using the same data

“If you torture the data long enough, it will confess” – Ronald H. Coase

•  More often than not, more data may be needed – datafication

Jian Pei: CMPT 741/459 Data Mining -- Introduction 21

Page 22: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Queries for Data-driven Thinking

•  Probe queries – finding information about specific individuals

•  Aggregation – finding information about groups •  Pattern finding – finding commonality in

population •  Association and correlation – finding

connections among individuals and groups •  Causality analysis – finding causes and

consequences

Jian Pei: CMPT 741/459 Data Mining -- Introduction 22

Page 23: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

What Is Data Mining?

•  Broader sense: the art of data-driven thinking

•  Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of

queries in the data mining process in the broader sense

Jian Pei: CMPT 741/459 Data Mining -- Introduction 23

Page 24: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Machine Learning

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

– Tom M. Mitchell •  Essentially, learn the distribution of data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 24

Page 25: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data mining vs. Machine Learning

•  Machine learning focuses on prediction, based on known properties learned from the training data

•  Data mining focuses on the discovery of (previously) unknown properties on the data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 25

Page 26: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Introduction 26

The KDD Process

Data

Target data

Preprocessed data

Transformed data

Patterns

Knowledge

Selection Preprocessing

Transformation

Data mining

Interpretation/evaluation

Page 27: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Mining R&D

•  New problem identification •  Data collection and transformation •  Algorithm design and implementation •  Evaluation

– Effectiveness evaluation – Efficiency & scalability evaluation

•  Deployment and business solution

Jian Pei: CMPT 741/459 Data Mining -- Introduction 27

Page 28: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Mining on Big Data

“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it”

– Hal Varian, Google’s Chief Economist

Jian Pei: CMPT 741/459 Data Mining -- Introduction 28

Page 29: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

What Is Big Data?

•  No quantitative definition! •  “Big data is like teenage sex

– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...”

– Dan Ariely

Jian Pei: CMPT 741/459 Data Mining -- Introduction 29

Page 30: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Volume vs. Storage Cost

•  The unit cost of disk storage decreases dramatically

Jian Pei: CMPT 741/459 Data Mining -- Introduction 30

Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB

http://ns1758.ca/winch/winchest.html

Page 31: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Big Data – Volume

“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time”

— Wikipedia

Jian Pei: CMPT 741/459 Data Mining -- Introduction 31

Page 32: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Big Data: Volume

•  Every day, about 7 billion shares change hands on US equity markets – About 2/3 is traded by computer algorithms based

on huge amounts of data to predict gains and risk •  In Q2 2015

– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million

outside China –  LinkedIn has 380 million active users – Twitter has 304 active users

Jian Pei: CMPT 741/459 Data Mining -- Introduction 32

Page 33: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Velocity

•  Google processes 24+ petabytes of data per day

•  Facebook gets 10+ million new photos uploaded every hour

•  Facebook members like or leave a comment 3+ billion times per day

•  YouTube users upload 1+ hour of video every second

•  400+ million tweets per day

Jian Pei: CMPT 741/459 Data Mining -- Introduction 33

Page 34: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

What Has Been Changed?

•  The 1880 census in the US took 8 years to complete – The 1890 census would need 13 years – using

punch cards, it was reduced to less than 1 year •  It is essential to get not only the accurate but

also the timely data – Statisticians use sampling to estimate

•  Recently, with the new technologies, the ways of data collection and transmission have been fundamentally changed

Jian Pei: CMPT 741/459 Data Mining -- Introduction 34

Page 35: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Sampling for Volume/Velocity?

•  Sampling idea: the marginal new information brought by larger amount of data shrinks quickly – The sample should be truly random

•  On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories of attribute combinations – Finding outliers and exceptions

•  Big data contains signals of different strengths – No noise, instead weaker and weaker, but still may

be interesting and important signals

Jian Pei: CMPT 741/459 Data Mining -- Introduction 35

Page 36: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Big Data – Leytro Pictures

•  Lytro pictures record the whole light field – Photographers can decide later which parts to

focus on •  Big data tries to record as much information

as possible – Analysts can decide later what to extract from

big data – Both advantages and challenges

Jian Pei: CMPT 741/459 Data Mining -- Introduction 36

Page 37: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Veracity

•  “1 in 3 business leaders don't trust the information they use to make decisions”

•  Assuming a slowly growing total cost budget, tradeoff between data volume and data quality

•  Loss of veracity in combining different types of information from different sources

•  Loss of veracity in data extraction, transformation, and processing

Jian Pei: CMPT 741/459 Data Mining -- Introduction 37

Page 38: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Variety

•  Integrating data capturing different aspects of a data object – Vancouver Canucks: game video, technical

statistics, social media, … – Different pieces are in different format

•  Different views of the same data object from different sources – Did the soccer ball pass the goal line? – The views may not be consistent

Jian Pei: CMPT 741/459 Data Mining -- Introduction 38

Page 39: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Four V-challenges

•  Volume: massive scale and growth, 40% per year in global data generated

•  Velocity: real time data generation and consumption

•  Variety: heterogeneous data, mainly unstructured or semi-structured, from many sources

•  Veracity

Jian Pei: CMPT 741/459 Data Mining -- Introduction 39

Page 40: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Is Big Data Really New?

•  People were aware of the existence of big data long time ago, but no one can access it until very recently –  (Genesis 28:15) “I am with you and will watch

over you wherever you go” –  “密室私语,天闻如雷;暗室欺⼼,神目如电;善恶之报,如影随⾏”

– Similar statements in Quran and Sutra •  What has been changed?

– How is data connected with people

Jian Pei: CMPT 741/459 Data Mining -- Introduction 40

Page 41: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Diversity in Data Usage

•  In the past, only very few projects can afford to be data-intensive

•  Nowadays, excessive applications are (naturally) data-intensive

Jian Pei: CMPT 741/459 Data Mining -- Introduction 41

Page 42: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Datafication

•  Extract data about an object or event in a quantified way so that it can be analyzed – Different from digitalization

•  An important feature of big data •  Key: new data, new applications, new

opportunities

Jian Pei: CMPT 741/459 Data Mining -- Introduction 42

Page 43: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

New Values of Datafication

•  Example: Captcha and ReCaptcha (Luis von Ahn)

•  How to create new values of data and datafication? – Connecting data with new users – Connecting different pieces of data to present a

bigger picture •  Important techniques

– Data aggregation – Extended datafication

Jian Pei: CMPT 741/459 Data Mining -- Introduction 43

Page 44: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Big Data Players

•  Data holders •  Data specialists •  Big-data mindset leaders •  A capable company may play 2 or 3 roles at

the same time •  What is most important, big-data mindset,

skills, or data itself?

Jian Pei: CMPT 741/459 Data Mining -- Introduction 44

Page 45: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Privacy

•  “… big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace”

— Executive Office of the (US) President

Jian Pei: CMPT 741/459 Data Mining -- Introduction 45

Page 46: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Keep in Mind

“Our industry does not respect tradition – it only respects innovation.”

– Satya Nadella

Jian Pei: CMPT 741/459 Data Mining -- Introduction 46

Page 47: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Introduction 47

Goals of This Course

•  Data-driven thinking – towards being a (big) data scientist

•  Principles and hands-on skills of data mining, particularly in the context of big data –  Identifying new data mining problems – Data mining algorithm design – Data mining applications

•  Novel problems for upcoming research

Page 48: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Format

•  Due to the fast progress in data mining, we will go beyond the textbook substantially

•  Active classroom discussion •  Open questions and brainstorming •  Textbook: Data Mining – Concepts and

Techniques (3rd ed)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 48

Page 49: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Read – Try – Think

•  Reading –  (required) Textbook and a small number of research

papers – You have to have the 3rd ed of the textbook! –  (open end, not covered by the exam) Technical and

non-technical materials •  Trying

– Assignments and a project •  Thinking

– Examine everything from a data scientist angle from today

Jian Pei: CMPT 741/459 Data Mining -- Introduction 49

Page 50: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Introduction 50

Data Mining: History

•  1989 IJCAI Workshop on Knowledge Discovery in Databases –  Knowledge Discovery in Databases (G.

Piatetsky-Shapiro and W. Frawley, 1991) •  91-94 Workshops on Knowledge

Discovery in Databases –  Advances in Knowledge Discovery and

Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

Page 51: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Introduction 51

Data Mining: History (cont’d)

•  95-98 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) –  Journal of Data Mining and Knowledge Discovery (1997)

•  ACM SIGKDD conferences since 1998 and SIGKDD Explorations

•  More conferences on data mining –  PAKDD (1997), PKDD (1997), SIAM-Data Mining

(2001), (IEEE) ICDM (2001), etc. •  ACM Transactions on KDD starting in 2007

Page 52: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Frequent Pattern Mining

Page 53: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

How Many Words Is a Picture Worth?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 53

E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Page 54: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Burnt or Burned?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 54

E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Page 55: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Store Layout Design

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 55

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Page 56: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Transaction Data

•  Alphabet: a set of items – Example: all products sold in a store

•  A transaction: a set of items involved in an activity – Example: the items purchased by a customer in

a visit •  Other information is often associated

– Timestamp, price, salesperson, customer-id, store-id, …

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 56

Page 57: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Examples of Transaction Data

•  •  •  •  • 

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 57

Page 58: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

How to Store Transaction Data?

•  Transaction-id (t123, a, b, c) (t236, b, d)

•  Relational storage •  Transaction-based storage •  Item-based (vertical) storage

–  Item a: …, t123, … –  Item b: …, t123, …, t236, … – …

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 58

Tid Item t123 a t123 b t123 c … … t236 b t236 d

Page 59: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 59

Transaction Data Analysis

•  Transactions: customers’ purchases of commodities –  {bread, milk, cheese} if they are bought together

•  Frequent patterns: product combinations that are frequently purchased together by customers

•  Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]

Page 60: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 60

Why Frequent Patterns?

•  What products were often purchased together?

•  What are the frequent subsequent purchases after buying a iPod?

•  What kinds of genes are sensitive to this new drug?

•  What key-word combinations are frequently associated with web pages about game-evaluation?

Page 61: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 61

Why Frequent Pattern Mining?

•  Foundation for many data mining tasks – Association rules, correlation, causality,

sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …

•  Broad applications – Basket data analysis, cross-marketing, catalog

design, sale campaign analysis, web log (click stream) analysis, …

Page 62: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 62

Frequent Itemsets

•  Itemset: a set of items –  E.g., acm = {a, c, m}

•  Support of itemsets –  Sup(acm) = 3

•  Given min_sup = 3, acm is a frequent pattern

•  Frequent pattern mining: finding all frequent patterns in a database

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB

Page 63: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 63

A Naïve Attempt

•  Generate all possible itemsets, test their supports against the database

•  How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets

•  How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the

support of 220 – 1 = 1,048,575 itemsets

Page 64: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 64

Transactions in Real Applications

•  A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books

relevant to data mining •  Walmart has more than 20 million

transactions per day, AT&T produces more than 275 million calls per day

•  Mining large transaction databases of many items is a real demand

Page 65: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 65

How to Get an Efficient Method?

•  Reducing the number of itemsets that need to be checked

•  Checking the supports of selected itemsets efficiently

Page 66: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 66

Candidate Generation & Test

•  Any subset of a frequent itemset must also be frequent – an anti-monotonic property –  A transaction containing {beer, diaper, nuts} also

contains {beer, diaper} –  {beer, diaper, nuts} is frequent à {beer, diaper} must

also be frequent •  In other words, any superset of an infrequent

itemset must also be infrequent –  No superset of any infrequent itemset should be

generated or tested –  Many item combinations can be pruned!

Page 67: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 67

Apriori-Based Mining

•  Generate length (k+1) candidate itemsets from length k frequent itemsets, and

•  Test the candidates against DB

Page 68: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 68

The Apriori Algorithm [AgSr94]

TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2

Itemset Sup a 2 b 3 c 3 d 1 e 3

Data base D 1-candidates

Scan D

Itemset Sup a 2 b 3 c 3 e 3

Freq 1-itemsets Itemset

ab ac ae bc be ce

2-candidates

Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2

Counting

Scan D

Itemset Sup ac 2 bc 2 be 3 ce 2

Freq 2-itemsets Itemset

bce

3-candidates

Itemset Sup bce 2

Freq 3-itemsets

Scan D

Page 69: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 69

The Apriori Algorithm Level-wise, candidate generation and test •  Ck: Candidate itemset of size k •  Lk : frequent itemset of size k

•  L1 = {frequent items}; •  for (k = 1; Lk !=∅; k++) do

–  Ck+1 = candidates generated from Lk; –  for each transaction t in database do increment the

count of all candidates in Ck+1 that are contained in t –  Lk+1 = candidates in Ck+1 with min_support

•  return ∪k Lk;

Candidate generation

Test

Page 70: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 70

Important Steps in Apriori

•  How to find frequent 1- and 2-itemsets? •  How to generate candidates?

– Step 1: self-joining Lk

– Step 2: pruning •  How to count supports of candidates?

Page 71: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 71

Finding Frequent 1- & 2-itemsets

•  Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array –  Initialize c[item]=0 for each item – For each transaction T, for each item in T,

c[item]++; –  If c[item]>=min_sup, item is frequent

•  Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij

Page 72: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 72

Counting Array

•  A 2-dimensional triangle matrix can be implemented using a 1-dimensional array

1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5

There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]

1 2 3 4 5 6 7 8 9 10

Page 73: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 73

Example of Candidate-generation

•  L3 = {abc, abd, acd, ace, bcd} •  Self-joining: L3*L3

– abcd ß abc * abd – acde ß acd * ace

•  Pruning: – acde is removed because ade is not in L3

•  C4={abcd}

Page 74: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 74

How to Generate Candidates? •  Suppose the items in Lk-1 are listed in an order •  Step 1: self-join Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

•  Step 2: pruning –  For each itemset c in Ck do

•  For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Page 75: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 75

How to Count Supports?

•  Why is counting supports of candidates a problem? –  The total number of candidates can be very huge –  One transaction may contain many candidates

•  Method –  Candidate itemsets are stored in a hash-tree –  A leaf node of hash-tree contains a list of itemsets and

counts –  Interior node contains a hash table –  Subset function: finds all the candidates contained in a

transaction

Page 76: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 76

Example: Counting Supports

1,4,7 2,5,8

3,6,9 Subset function

2 3 4 5 6 7

1 4 5 1 3 6

1 2 4 4 5 7 1 2 5

4 5 8 1 5 9

3 4 5 3 5 6 3 5 7 6 8 9

3 6 7 3 6 8

Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6

1 3 + 5 6

Page 77: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 77

Association Rules

•  Rule c à am •  Support: 3 (i.e., the support

of acm) •  Confidence: 75% (i.e.,

sup(acm) / sup(c)) •  Given a minimum support

threshold and a minimum confidence threshold, find all association rules whose support and confidence passing the thresholds

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB

Page 78: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 78

Challenges of Freq Pat Mining

•  Multiple scans of transaction database •  Huge number of candidates •  Tedious workload of support counting for

candidates

Page 79: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 79

Improving Apriori: Ideas

•  Reducing the number of transaction database scans

•  Shrinking the number of candidates •  Facilitating support counting of candidates

Page 80: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 80

Bottleneck of Freq Pattern Mining

•  Multiple database scans are costly •  Mining long patterns needs many scans and

generates many candidates – To find frequent itemset i1i2…i100

•  # of scans: 100 •  # of Candidates:

– Bottleneck: candidate-generation-and-test •  Can we avoid candidate generation?

30100 1027.112100100

2100

1100

×≈−=⎟⎟⎠

⎞⎜⎜⎝

⎛++⎟⎟

⎞⎜⎜⎝

⎛+⎟⎟⎠

⎞⎜⎜⎝

⎛!

Page 81: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 81

Search Space of Freq. Pat. Mining

•  Itemsets form a lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice

Page 82: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 82

Set Enumeration Tree

•  Use an order on items, enumerate itemsets in lexicographic order –  a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d

•  Reduce a lattice to a tree ∅

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd

Set enumeration tree

Page 83: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 83

Borders of Frequent Itemsets

•  Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd

Page 84: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 84

Projected Databases

•  To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected

database ∅

a b c d ab ac ad bc bd cd

abc abd acd bcd

abcd

Page 85: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 85

Compress Database by FP-tree •  The 1st scan: find

frequent items –  Only record frequent

items in FP-tree –  F-list: f-c-a-b-m-p

•  The 2nd scan: construct tree –  Order frequent items in

each transaction w.r.t. f-list

–  Explore sharing among transactions

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

TID Items bought (ordered) freq items

100 f, a, c, d, g, I, m, p f, c, a, m, p

200 a, b, c, f, l,m, o f, c, a, b, m

300 b, f, h, j, o f, b

400 b, c, k, s, p c, b, p

500 a, f, c, e, l, p, m, n f, c, a, m, p

Page 86: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 86

Benefits of FP-tree

•  Completeness –  Never break a long pattern in any transaction –  Preserve complete information for freq pattern mining

•  Not scan database anymore

•  Compactness –  Reduce irrelevant info — infrequent items are removed –  Items in frequency descending order (f-list): the more

frequently occurring, the more likely to be shared –  Never be larger than the original database (not counting

node-links and the count fields)

Page 87: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 87

Partitioning Frequent Patterns

•  Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f

•  Depth-first search of a set enumeration tree – The partitioning is complete and does not have

any overlap

Page 88: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 88

•  Only transactions containing p are needed •  Form p-projected database

– Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p

Find Patterns Having Item “p”

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

p-projected database TDB|p fcam: 2 cb: 1

Local frequent item: c:3 Frequent patterns containing p

p: 3, pc: 3

Page 89: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 89

Find Pat Having Item m But No p

•  Form m-projected database TDB|m –  Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a

•  Build FP-tree for TDB|m root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

Header table item

f c a

root f:3 c:3 a:3

m-projected FP-tree

Page 90: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 90

Recursive Mining

•  Patterns having m but no p can be mined recursively

•  Optimization: enumerate patterns from a single-branch FP-tree – Enumerate all combination – Support = that of the last item

•  m, fm, cm, am •  fcm, fam, cam •  fcam

Header table item

f c a

root

f:3

c:3

a:3

m-projected FP-tree

Page 91: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 91

Enumerate Patterns From Single Prefix of FP-tree •  A (projected) FP-tree has a single prefix

– Reduce the single prefix into one node – Join the mining results of the two parts

Ú a2:n2

a3:n3

a1:n1

root

b1:m1 c1:k1

c2:k2 c3:k3

+ a2:n2

a3:n3

a1:n1

root

r =

r1

b1:m1 c1:k1

c2:k2 c3:k3

Page 92: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 92

FP-growth

•  Pattern-growth: recursively grow frequent patterns by pattern and database partitioning

•  Algorithm –  For each frequent item, construct its projected database,

and then its projected FP-tree –  Repeat the process on each newly created projected

FP-tree –  Until the resulted FP-tree is empty, or contains only one

path – single path generates all the combinations, each of which is a frequent pattern

Page 93: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 93

Scaling up by DB Projection

•  What if an FP-tree cannot fit into memory? •  Database projection

– Partition a database into a set of projected databases

– Construct and mine FP-tree once the projected database can fit into main memory

•  Heuristic: Projected database shrinks quickly in many applications

Page 94: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 94

Parallel vs. Partition Projection

•  Parallel projection: form all projected database at a time

•  Partition projection: propagate projections

Tran. DB fcamp fcabm fb cbp fcamp

p-proj DB fcam cb fcam

m-proj DB fcab fca fca

b-proj DB f cb …

a-proj DB fc …

c-proj DB f …

f-proj DB …

am-proj DB fc fc fc

cm-proj DB f f f

Page 95: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 95

Why Is FP-growth Efficient?

•  Divide-and-conquer strategy – Decompose both the mining task and DB – Lead to focused search of smaller databases

•  Other factors – No candidate generation nor candidate test – Database compression using FP-tree – No repeated scan of entire database – Basic operations – counting local frequent items

and building FP-tree, no pattern search nor pattern matching

Page 96: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 96

Major Costs in FP-growth

•  Poor locality of FP-trees – Low hit rate of cache

•  Building FP-trees – A stack of FP-trees

•  Redundant information – Transaction abcd appears in a-, ab-, abc-, ac-, …, c- projected databases and FP-trees

Page 97: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Effectiveness of Freq Pat Mining

•  Too many patterns! – A pattern a1a2…an contains 2n-1 subpatterns – Understanding many patterns is difficult or even

impossible for human users •  Non-focused mining

– A manager may be only interested in patterns involving some items (s)he manages

– A user is often interested in patterns satisfying some constraints

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 97

Page 98: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Itemset Lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD, ACD

Min_sup=2

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 98

Page 99: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Max-Patterns ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD

Min_sup=2

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 99

Page 100: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Borders and Max-patterns

•  Max-patterns: borders of frequent patterns – Any subset of max-pattern is frequent – Any superset of max-pattern is infrequent – Cannot generate rules ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{} Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 100

Page 101: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Patterns and Support Counts ABCD

ABC:2 ABD:2 ACD BCD

AB:3 AC:2 BC:2 AD:3 BD:2 CD:2

A:4 B:4 C:3 D:4

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Len Frequent itemsets 1 A:4, B:4, C:3, D:4 2 AB:3, AC:2, AD:3, BC:3, BD:2, CD:2 3 ABC:2, ABD:2

Min_sup=2

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 101

Page 102: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Frequent Closed Patterns

•  For frequent itemset X, if there exists no item y not in X s.t. every transaction containing X also contains y, then X is a frequent closed pattern – “acdf” is a frequent closed pattern

•  Concise rep. of freq pats – Can generate non-redundant rules

•  Reduce # of patterns and rules •  N. Pasquier et al. In ICDT’99

TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f

Min_sup=2

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 102

Page 103: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Closed and Max-patterns

•  Closed pattern mining algorithms can be adapted to mine max-patterns – A max-pattern must be closed

•  Depth-first search methods have advantages over breadth-first search ones – Why?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 103

Page 104: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Constraint-based Data Mining

•  Find all the patterns in a database autonomously? –  The patterns could be too many but not focused!

•  Data mining should be interactive –  User directs what to be mined

•  Constraint-based mining –  User flexibility: provides constraints on what to be mined –  System optimization: push constraints for efficient mining

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 104

Page 105: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Constraints in Data Mining

•  Knowledge type constraint –  classification, association, etc.

•  Data constraint — using SQL-like queries –  find product pairs sold together in stores in New York

•  Dimension/level constraint –  in relevance to region, price, brand, customer category

•  Rule (or pattern) constraint –  small sales (price < $10) triggers big sales (sum >$200)

•  Interestingness constraint –  strong rules: support and confidence

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 105

Page 106: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Constrained Mining vs. Search

•  Constrained mining vs. constraint-based search –  Both aim at reducing search space –  Finding all patterns vs. some (or one) answers satisfying

constraints –  Constraint-pushing vs. heuristic search –  An interesting research problem on integrating both

•  Constrained mining vs. DBMS query processing –  Database query processing requires to find all –  Constrained pattern mining shares a similar philosophy

as pushing selections deeply in query processing

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 106

Page 107: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Optimization

•  Mining frequent patterns with constraint C –  Sound: only find patterns satisfying the constraints C –  Complete: find all patterns satisfying the constraints C

•  A naïve solution –  Constraint test as a post-processing

•  More efficient approaches –  Analyze the properties of constraints –  Push constraints as deeply as possible into frequent

pattern mining

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 107

Page 108: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Anti-Monotonicity

•  Anti-monotonicity – An intemset S violates the constraint, so does

any of its superset – sum(S.Price) ≤ v is anti-monotone – sum(S.Price) ≥ v is not anti-monotone

•  Example – C: range(S.profit) ≤ 15 –  Itemset ab violates C – So does every superset of ab

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 108

Page 109: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Anti-monotonic Constraints Constraint Antimonotone

v ∈ S No S ⊆ V no S ⊆ V yes

min(S) ≤ v no min(S) ≥ v yes max(S) ≤ v yes max(S) ≥ v no

count(S) ≤ v yes count(S) ≥ v no

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no

range(S) ≤ v yes range(S) ≥ v no

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ yes support(S) ≤ ξ no

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 109

Page 110: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Monotonicity

•  Monotonicity – An intemset S satisfies the constraint, so does

any of its superset – sum(S.Price) ≥ v is monotone – min(S.Price) ≤ v is monotone

•  Example – C: range(S.profit) ≥ 15 –  Itemset ab satisfies C – So does every superset of ab

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 110

Page 111: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Monotonic Constraints Constraint Monotone

v ∈ S yes S ⊆ V yes S ⊆ V no

min(S) ≤ v yes min(S) ≥ v no max(S) ≤ v no max(S) ≥ v yes

count(S) ≤ v no count(S) ≥ v yes

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) no sum(S) ≥ v ( a ∈ S, a ≥ 0 ) yes

range(S) ≤ v no range(S) ≥ v yes

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ no support(S) ≤ ξ yes

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 111

Page 112: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Converting “Tough” Constraints

•  Convert tough constraints into anti-monotone or monotone by properly ordering items

•  Examine C: avg(S.profit) ≥ 25 –  Order items in value-descending order

•  <a, f, g, d, b, h, c, e>

–  If an itemset afb violates C •  So does afbh, afb* •  It becomes anti-monotone!

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 112

Page 113: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Convertible Constraints

•  Let R be an order of items •  Convertible anti-monotone

–  If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≤ v w.r.t. item value descending order •  Convertible monotone

–  If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≥ v w.r.t. item value descending order

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 113

Page 114: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Strongly Convertible Constraints

•  avg(X) ≥ 25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> –  Itemset af violates a constraint C, so does

every itemset with af as prefix, such as afd •  avg(X) ≥ 25 is convertible monotone

w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> –  Itemset d satisfies a constraint C, so does

itemsets df and dfa, which having d as a prefix

•  Thus, avg(X) ≥ 25 is strongly convertible

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 114

Page 115: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Convertible Constraints

Constraint Convertible anti-monotone

Convertible monotone

Strongly convertible

avg(S) ≤ , ≥ v Yes Yes Yes

median(S) ≤ , ≥ v Yes Yes Yes

sum(S) ≤ v (items could be of any value, v ≥ 0) Yes No No

sum(S) ≤ v (items could be of any value, v ≤ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≥ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≤ 0) Yes No No

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 115

Page 116: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Can Apriori Handle Convertible Constraint? •  A convertible, not monotone nor anti-

monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm –  Within the level wise framework, no direct

pruning based on the constraint can be made –  Itemset df violates constraint C: avg(X)>=25 –  Since adf satisfies C, Apriori needs df to

assemble adf, df cannot be pruned •  But it can be pushed into frequent-pattern

growth framework!

Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 116

Page 117: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Mining With Convertible Constraints •  C: avg(S.profit) ≥ 25 •  List of items in every transaction in

value descending order R: –  <a, f, g, d, b, h, c, e> –  C is convertible anti-monotone w.r.t. R

•  Scan transaction DB once –  remove infrequent items

•  Item h in transaction 40 is dropped

–  Itemsets a and f are good

TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e

TDB (min_sup=2)

Item Profit a 40 f 30 g 20 d 10 b 0 h -10 c -20 e -30

Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 117

Page 118: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Not Every Pattern Is Interesting!

•  Trivial patterns – Pregnant à Female 100% confidence

•  Misleading patterns – Play basketball à eat cereal [40%, 66.7%]

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 118

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Page 119: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Evaluation Criteria

•  Objective interestingness measures – Examples: support, patterns formed by mutually

independent items – Domain independent

•  Subjective measures – Examples: domain knowledge, templates/

constraints

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 119

Page 120: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 120

Correlation and Lift

•  P(B|A)/P(B) is called the lift of rule A à B

•  Play basketball à eat cereal (lift: 0.89) •  Play basketball à not eat cereal (lift: 1.33)

corrA,B =P(A∪B)P(A)P(B)

=P(AB)

P(A)P(B)

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Contingency table 372 Chapter 6 Association Analysis

Table 6.7. A 2-way contingency table for variables A and B.

B B

A f11 f10 f1+

A f01 f00 f0+

f+1 f+0 N

counts tabulated in a contingency table. Table 6.7 shows an example of acontingency table for a pair of binary variables, A and B. We use the notationA (B) to indicate that A (B) is absent from a transaction. Each entry fij inthis 2 × 2 table denotes a frequency count. For example, f11 is the number oftimes A and B appear together in the same transaction, while f01 is the num-ber of transactions that contain B but not A. The row sum f1+ representsthe support count for A, while the column sum f+1 represents the supportcount for B. Finally, even though our discussion focuses mainly on asymmet-ric binary variables, note that contingency tables are also applicable to otherattribute types such as symmetric binary, nominal, and ordinal variables.

Limitations of the Support-Confidence Framework Existing associa-tion rule mining formulation relies on the support and confidence measures toeliminate uninteresting patterns. The drawback of support was previously de-scribed in Section 6.8, in which many potentially interesting patterns involvinglow support items might be eliminated by the support threshold. The draw-back of confidence is more subtle and is best demonstrated with the followingexample.

Example 6.3. Suppose we are interested in analyzing the relationship be-tween people who drink tea and coffee. We may gather information about thebeverage preferences among a group of people and summarize their responsesinto a table such as the one shown in Table 6.8.

Table 6.8. Beverage preferences among a group of 1000 people.

Coffee Coffee

Tea 150 50 200

Tea 650 150 800

800 200 1000

Page 121: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Property of Lift

•  If A and B are independent, lift = 1 •  If A and B are positively correlated, lift > 1 •  If A and B are negatively correlated, lift < 1 •  Limitation: lift is sensitive to P(A) and P(B)

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 121

374 Chapter 6 Association Analysis

Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.

p p r r

q 880 50 930 s 20 50 70

q 50 20 70 s 50 880 930

930 70 1000 70 930 1000

This equation follows from the standard approach of using simple fractionsas estimates for probabilities. The fraction f11/N is an estimate for the jointprobability P (A, B), while f1+/N and f+1/N are the estimates for P (A) andP (B), respectively. If A and B are statistically independent, then P (A, B) =P (A) × P (B), thus leading to the formula shown in Equation 6.6. UsingEquations 6.5 and 6.6, we can interpret the measure as follows:

I(A, B)

⎧⎨

= 1, if A and B are independent;> 1, if A and B are positively correlated;< 1, if A and B are negatively correlated.

(6.7)

For the tea-coffee example shown in Table 6.8, I = 0.150.2×0.8 = 0.9375, thus sug-

gesting a slight negative correlation between tea drinkers and coffee drinkers.

Limitations of Interest Factor We illustrate the limitation of interestfactor with an example from the text mining domain. In the text domain, itis reasonable to assume that the association between a pair of words dependson the number of documents that contain both words. For example, becauseof their stronger association, we expect the words data and mining to appeartogether more frequently than the words compiler and mining in a collectionof computer science articles.

Table 6.9 shows the frequency of occurrences between two pairs of words,{p, q} and {r, s}. Using the formula given in Equation 6.5, the interest factorfor {p, q} is 1.02 and for {r, s} is 4.08. These results are somewhat troublingfor the following reasons. Although p and q appear together in 88% of thedocuments, their interest factor is close to 1, which is the value when p and qare statistically independent. On the other hand, the interest factor for {r, s}is higher than {p, q} even though r and s seldom appear together in the samedocument. Confidence is perhaps the better choice in this situation because itconsiders the association between p and q (94.6%) to be much stronger thanthat between r and s (28.6%).

lift(p, q) < lift(r, s)!

Page 122: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 122

From Itemsets to Sequences

•  Itemsets: combinations of items, no temporal order •  Temporal order is important in many situations

–  Time-series databases and sequence databases –  Frequent patterns à (frequent) sequential patterns

•  Applications of sequential pattern mining –  Customer shopping sequences:

•  First buy computer, then iPod, and then digital camera, within 3 months.

–  Medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures

Page 123: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 123

What Is Sequential Pattern Mining?

•  Given a set of sequences, find the complete set of frequent subsequences

A sequence database A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

Page 124: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 124

Challenges in Seq Pat Mining

•  A huge number of possible sequential patterns are hidden in databases

•  A mining algorithm should – Find the complete set of patterns satisfying the

minimum support (frequency) threshold – Be highly efficient, scalable, involving only a

small number of database scans – Be able to incorporate various kinds of user-

specific constraints

Page 125: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 125

Apriori Property of Seq Patterns

•  Apriori property in sequential patterns –  If a sequence S is infrequent, then none of the

super-sequences of S is frequent – E.g, <hb> is infrequent à so do <hab> and

<(ah)b>

Given support threshold min_sup =2

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Page 126: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 126

GSP

•  GSP (Generalized Sequential Pattern) mining •  Outline of the method

–  Initially, every item in DB is a candidate of length-1 –  For each level (i.e., sequences of length-k) do

•  Scan database to collect support count for each candidate sequence

•  Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori

–  Repeat until no frequent sequence or no candidate can be found

•  Major strength: Candidate pruning by Apriori

Page 127: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 127

Finding Len-1 Seq Patterns

•  Initial candidates – <a>, <b>, <c>, <d>, <e>, <f>, <g>,

<h> •  Scan database once

– count support for candidates

min_sup =2

Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Page 128: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 128

Generating Length-2 Candidates <a> <b> <c> <d> <e> <f>

<a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>

51 length-2 Candidates

Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes

44.57% candidates

Page 129: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 129

Finding Len-2 Seq Patterns

•  Scan database one more time, collect support count for each length-2 candidate

•  There are 19 length-2 candidates which pass the minimum support threshold – They are length-2 sequential patterns

Page 130: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 130

Generating Length-3 Candidates and Finding Length-3 Patterns •  Generate Length-3 Candidates

– Self-join length-2 sequential patterns •  <ab>, <aa> and <ba> are all length-2 sequential

patterns à <aba> is a length-3 candidate •  <(bd)>, <bb> and <db> are all length-2 sequential

patterns à <(bd)b> is a length-3 candidate – 46 candidates are generated

•  Find Length-3 Sequential Patterns – Scan database once more, collect support

counts for candidates – 19 out of 46 candidates pass support threshold

Page 131: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 131

The GSP Mining Process

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

min_sup =2

Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Page 132: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 132

The GSP Algorithm

•  Take sequences in form of <x> as length-1 candidates

•  Scan database once, find F1, the set of length-1 sequential patterns

•  Let k=1; while Fk is not empty do –  Form Ck+1, the set of length-(k+1) candidates from Fk; –  If Ck+1 is not empty, scan database once, find Fk+1, the

set of length-(k+1) sequential patterns –  Let k=k+1;

Page 133: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 133

Bottlenecks of GSP

•  A huge set of candidates – 1,000 frequent length-1 sequences generate

length-2 candidates! •  Multiple scans of database in mining •  Real challenge: mining long sequential

patterns – An exponential number of short candidates – A length-100 sequential pattern needs 1030

candidate sequences!

500,499,12999100010001000 =

×+×

30100100

11012

100≈−=⎟⎟

⎞⎜⎜⎝

⎛∑=i i

Page 134: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 134

FreeSpan: Freq Pat-projected Sequential Pattern Mining •  The itemset of a seq pat must be frequent

– Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns

– Mine each projected database to find its patterns f_list: b:5, c:4, a:3, d:3, e:3, f:2

All seq. pat. can be divided into 6 subsets: • Seq. pat. containing item f • Those containing e but no f • Those containing d but no e nor f • Those containing a but no d, e or f • Those containing c but no a, d, e or f • Those containing only item b

Sequence Database SDB < (bd) c b (ac) > < (bf) (ce) b (fg) > < (ah) (bf) a b f > < (be) (ce) d > < a (bd) b c b (ade) >

Page 135: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 135

From FreeSpan to PrefixSpan

•  Freespan: – Projection-based: no candidate sequence needs

to be generated – But, projection can be performed at any point in

the sequence, and the projected sequences may not shrink much

•  PrefixSpan – Projection-based – But only prefix-based projection: less projections

and quickly shrinking sequences

Page 136: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 136

Prefix and Suffix (Projection)

•  <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>

•  Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based Projection)

<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)>

Page 137: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 137

Mining Sequential Patterns by Prefix Projections •  Step 1: find length-1 sequential patterns

– <a>, <b>, <c>, <d>, <e>, <f> •  Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6 subsets: – The ones having prefix <a>; – The ones having prefix <b>; – … – The ones having prefix <f>

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

Page 138: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 138

Finding Seq. Pat. with Prefix <a>

•  Only need to consider projections w.r.t. <a> – <a>-projected database: <(abc)(ac)d(cf)>,

<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> •  Find all the length-2 seq. pat. having prefix

<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> – Further partition into 6 subsets

•  Having prefix <aa>; • … •  Having prefix <af>

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

Page 139: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 139

Completeness of PrefixSpan

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDB

Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>

<a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

<b>-projected database … Having prefix <b>

Having prefix <c>, …, <f>

… …

Page 140: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 140

Efficiency of PrefixSpan

•  No candidate sequence needs to be generated

•  Projected databases keep shrinking •  Major cost of PrefixSpan: constructing

projected databases – Can be improved by bi-level projections

Page 141: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Effectiveness

•  Redundancy due to anti-monotonicity –  {<abcd>} leads to 15 sequential patterns of

same support – Closed sequential patterns and sequential

generators •  Constraints on sequential patterns

– Gap – Length – More sophisticated, application oriented

constraints

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 141

Page 142: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Warehousing & OLAP

Page 143: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 143

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)

Product information (Product-id, category, manufacturer, made-in, stock-price, …)

Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)

Business queries: •  Which categories of products are most popular for customers in Vancouver? •  Find pairs (customer groups, most popular products)

Page 144: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 144

Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …

In what aspect is he most similar to cases of coronary artery disease

and, at the same time, dissimilar to adiposity?

Page 145: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Don’t You Ever Google Yourself?

•  Big data makes one know oneself better •  57% American adults search themselves on

Internet – Good news: those people are

better paid than those who haven’t done so! (Investors.com)

•  Egocentric analysis becomes more and more important with big data

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 145

Page 146: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Egocentric Analysis

•  How am I different from (more often than not, better than) others?

•  In what aspects am I good?

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 146

http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg

Page 147: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Dimensions

•  “An aspect or feature of a situation, problem, or thing, a measurable extent of some kind”

– Dictionary •  Dimensions/attributes are used to model

complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/

attributes •  More often than not, objects have too many

dimensions/attributes than one is interested in and can handle

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 147

Page 148: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Multi-dimensional Analysis

•  Find interesting patterns in multi-dimensional subspaces –  “Michael Jordan is outstanding in subspaces (total

points, total rebounds, total assists) and (number of games played, total points, total assists)”

•  Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics):

select a subset of relevant features for use in model construction – a set of features for all objects

– Different subspaces may manifest different patterns

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 148

Page 149: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 149

OLAP

•  Conceptually, we may explore all possible subspaces for interesting patterns

•  What patterns are interesting? •  How can we explore all possible subspaces

systematically and efficiently? •  Fundamental problems in analytics and data

mining

Page 150: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 150

OLAP

•  Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; –  In TPC, 6 standard benchmarks have 83 queries,

aggregates are used 59 times, group-bys are used 20 times

•  Online analytical processing (OLAP): the techniques that answer multi-dimensional analytical (MDA) queries efficiently

Page 151: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 151

OLAP Operations

•  Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction –  (Day, Store, Product type, SUM(sales) à

(Month, City, *, SUM(sales)) •  Drill down (roll down): reverse of roll-up,

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Page 152: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Roll Up

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 152

http://www.tutorialspoint.com/dwh/images/rollup.jpg

Page 153: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Drill Down

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 153

http://www.tutorialspoint.com/dwh/images/drill_down.jpg

Page 154: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Other Operations

•  Dice: pick specific values or ranges on some dimensions

•  Pivot: “rotate” a cube – changing the order of dimensions in visual analysis

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 154

http://en.wikipedia.org/wiki/File:OLAP_pivoting.png

Page 155: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Dice

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 155

http://www.tutorialspoint.com/dwh/images/dice.jpg

Page 156: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 156

Relational Representation

•  If there are n dimensions, there are 2n possible aggregation columns

Roll up by model by year by color in a table

Page 157: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 157

Difficulties

•  Many group bys are needed – 6 dimensions à 26=64 group bys

•  In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait!

Page 158: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 158

Dummy Value “ALL”

Page 159: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 159

CUBE

SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);

Page 160: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 160

Semantics of ALL

•  ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue}

Page 161: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 161

OLTP Versus OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support DB design application-oriented subject-oriented

data current, up-to-date, detailed, flat relational Isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write, index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed

tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 162: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 162

What Is a Data Warehouse?

•  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”

– W. H. Inmon •  Data warehousing: the process of

constructing and using data warehouses

Page 163: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 163

Subject-Oriented

•  Organized around major subjects, such as customer, product, sales

•  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

•  Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Page 164: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 164

Integrated

•  Integrating multiple, heterogeneous data sources –  Relational databases, flat files, on-line transaction

records •  Data cleaning and data integration

–  Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

•  E.g., Hotel price: currency, tax, breakfast covered, etc.

–  When data is moved to the warehouse, it is converted

Page 165: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 165

Time Variant

•  The time horizon for the data warehouse is significantly longer than that of operational systems –  Operational databases: current value data –  Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years) •  Every key structure in the data warehouse contains

an element of time, explicitly or implicitly –  But the key of operational data may or may not contain “time element”

Page 166: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 166

Nonvolatile

•  A physically separate store of data transformed from the operational environment

•  Operational updates of data do not occur in the data warehouse environment – Do not require transaction processing, recovery,

and concurrency control mechanisms – Require only two operations in data accessing

•  Initial loading of data •  Access of data

Page 167: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 167

Why Separate Data Warehouse?

•  High performance for both – Operational DBMS: tuned for OLTP – Warehouse: tuned for OLAP

•  Different functions and different data – Historical data: data analysis often uses

historical data that operational databases do not typically maintain

– Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources

Page 168: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Warehouse Schema Design

•  Query answering efficiency – Subject orientation –  Integration

•  Tradeoff between time and space – Universal table versus fully normalized schema

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 168

Page 169: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 169

Star Schema

time_key day day_of_the_week month quarter year

time

location_key street city state_or_province country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

Page 170: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 170

Snowflake Schema

time_key day day_of_the_week month quarter year

time

location_key street city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_key

item

branch_key branch_name branch_type

branch

supplier_key supplier_type

supplier

city_key city state_or_province country

city

Page 171: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Fact Constellation

time_key day day_of_the_week month quarter year

time

location_key street city province_or_state country

location

Sales Fact Table

time_key item_key branch_key location_key units_sold dollars_sold avg_sales

Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

Shipping Fact Table

time_key item_key shipper_key from_location

to_location dollars_cost units_shipped

shipper_key shipper_name location_key shipper_type

shipper

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 171

Page 172: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 172

(Good) Aggregate Functions

•  Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j |i=1,...,lj}) | j=1,…n}) –  Examples: COUNT(), MIN(), MAX(), SUM() –  G=SUM() for COUNT()

•  Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n }) –  Examples: AVG(), standard deviation, MaxN(), MinN() –  For AVG(), G() records sum and count, H() adds these

two components and divides to produce the global average

Page 173: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 173

Holistic Aggregate Functions

•  There is no constant bound on the size of the storage needed to describe a sub-aggregate. – There is no constant M, such that an M-tuple

characterizes the computation F({Xi,j |i=1,...,I}).

•  Examples: Median(), MostFrequent() (also called the Mode()), and Rank()

Page 174: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 174

Index Requirements in OLAP

•  Data is read only –  (Almost) no insertion or deletion

•  Query types – Point query: looking up one specific tuple (rare) – Range query: returning the aggregate of a

(large) set of tuples, with group by – Complex queries: need specific algorithms and

index structures, will be discussed later

Page 175: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 175

OLAP Query Example

•  In table (cust, gender, …), find the total number of male customers

•  Method 1: scan the table once •  Method 2: build a B+ tree index on attribute

gender, still need to access all tuples of male customers

•  Can we get the count without scanning many tuples, even not all tuples of male customers?

Page 176: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 176

Bitmap Index

•  For n tuples, a bitmap index has n bits and can be packed into ⎡n /8⎤ bytes and ⎡n /32⎤ words

•  From a bit to the row-id: the j-th bit of the p-th byte à row-id = p*8 +j cust gender …

Jack M … Cathy F … … … …

Nancy F …

1 0 … 0

Page 177: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 177

Using Bitmap to Count

•  Shcount[] contains the number of bits in the entry subscript – Example: shcount[01100101]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[B[i]];

Page 178: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 178

Advantages of Bitmap Index

•  Efficient in space •  Ready for logic composition

– C = C1 AND C2 – Bitmap operations can be used

•  Bitmap index only works for categorical data with low cardinality – Naively, we need 50 bits per entry to represent

the state of a customer in US – How to represent a sale in dollars?

Page 179: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 179

Bit-Sliced Index

•  A sale amount can be written as an integer number of pennies, and then be represented as a binary number of N bits – 24 bits is good for up to $167,772.15,

appropriate for many stores •  A bit-sliced index is N bitmaps

– Tuple j sets in bitmap k if the k-th bit in its binary representation is on

– The space costs of bit-sliced index is the same as storing the data directly

Page 180: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 180

Using Indexes

SELECT SUM(sales) FROM Sales WHERE C; – Tuples satisfying C is identified by a bitmap B

•  Direct access to rows to calculate SUM: scan the whole table once

•  B+ tree: find the tuples from the tree •  Projection index: only scan attribute sales •  Bit-sliced index: get the sum from ∑(B AND

Bk)*2k

Page 181: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 181

Cost Comparison

•  Traditional value-list index (B+ tree) is costly in both I/O and CPU time – Not good for OLAP

•  Bit-sliced index is efficient in I/O •  Other case studies in [O’Neil and Quass,

SIGMOD’97]

Page 182: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 182

Horizontal or Vertical Storage

•  A fact table for data warehousing is often fat –  Tens of even hundreds of dimensions/attributes

•  A query is often about only a few attributes •  Horizontal storage: tuples are stored one by one •  Vertical storage: tuples are stored by attributes

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100

Page 183: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 183

Horizontal Versus Vertical

•  Find the information of tuple t –  Typical in OLTP –  Horizontal storage: get the whole tuple in one search –  Vertical storage: search 100 lists

•  Find SUM(a100) GROUP BY {a22, a83} –  Typical in OLAP –  Horizontal storage (no index): search all tuples O(100n),

where n is the number of tuples –  Vertical storage: search 3 lists O(3n), 3% of the

horizontal storage method •  Projection index: vertical storage

Page 184: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 184

MOLAP

Date

Cou

ntry

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr U.S.A

Canada

Mexico

sum

Page 185: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 185

Pros and Cons

•  Easy to implement •  Fast retrieval •  Many entries may be empty if data is sparse •  Costly in space

Page 186: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 186

ROLAP – Data Cube in Table

•  A multi-dimensional database Base table

Dimensions Measure Store Product Season AVG(Sales)

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9

Dimensions Measure Store Product Season Sales

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9

Cubing

Page 187: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Cube: A Lattice of Cuboids

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 187

time,item

time,item,location

time, item, location, supplierc

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 188: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Data Cube: A Lattice of Cuboids

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 188

•  Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells (9/15, milk, Urbana, Dairy_land), (9/15, milk, Urbana, *), (*, milk, Urbana, *), (*, milk, Urbana, *) (*, milk, Chicago, *), (*, milk, *, *)

all

time,item

time,item,location

time, item, location, supplier

time item location supplier

time,location time,supplier

item,location item,supplier

location,supplier

time,item,supplier time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 189: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Full Cube vs. Iceberg Cube

•  Full cube vs. iceberg cube compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min support

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 189

n  Avoidexplosivegrowth:Acubewith100dimensionsn  2basecells:(a1,a2,….,a100),(b1,b2,…,b100)n  Howmanyaggregatecellsif“havingcount>=1”?n  Whatabout“havingcount>=2”?

iceberg condition

Page 190: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Multi-Way Array Aggregation

•  Array-based “bottom-up” algorithm

•  Using multi-dimensional chunks •  No direct tuple comparisons •  Simultaneous aggregation on

multiple dimensions •  Intermediate aggregate values

are re-used for computing ancestor cuboids

•  Cannot do Apriori pruning: No iceberg optimization

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 190

ABC

AB

A

All

B

AC BC

C

Page 191: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Multi-way Array Aggregation for Cube Computation (MOLAP)

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 191

•  Partition arrays into chunks (a small subcube which fits in memory). •  Compressed sparse array addressing: (chunk_id, offset) •  Compute aggregates in “multiway” by visiting cube cells in the order which

minimizes the # of times to visit each cell, and reduces memory access & storage cost.

What is the best traversing order to do multi-way aggregation?

A

B 29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1 a0

c3 c2

c1 c 0

b3

b2

b1

b0 a2 a3

C

B

4428 5640

24 523620

60

Page 192: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Multi-way Array Aggregation for Cube Computation (3-D to 2-D)

all

A B

AB

ABC

AC BC

C

•  The best order is the one that minimizes the memory requirement and reduced I/Os

ABC

AB

A

All

B

AC BC

C

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 192

Page 193: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Multi-way Array Aggregation for Cube Computation (2-D to 1-D)

ABC

AB

A

All

B

AC BC

C

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 193

Page 194: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Multi-Way Array Aggregation for Cube Computation •  Method: the planes should be sorted and

computed according to their size in ascending order –  Idea: keep the smallest plane in the main memory,

fetch and compute only one chunk at a time for the largest plane

•  Limitation of the method: computing well only for a small number of dimensions –  If there are a large number of dimensions, “top-

down” computation and iceberg cube computation methods can be explored

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 194

Page 195: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 195

Iceberg Cube

•  In a data cube, many aggregate cells are trivial – Having an aggregate too small

•  Iceberg query

Page 196: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 196

Monotonic Iceberg Condition

•  If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c

•  For cells c1 and c2, c1 is called an ancestor of c2 if in all dimensions that c1 takes a non-* value, c2 agrees with c1 –  (a,b,*) is an ancestor of (a,b,c)

•  An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P

Page 197: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 197

BUC

•  Once a base table (A, B, C) is sorted by A-B-C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters

•  To compute other aggregates, we can sort the base table in some other orders

Page 198: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Example

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 198

Location Year Color Amount Vancouver 2015 Yellow 300 Victoria 2014 Red 400 Seattle 2015 Green 120 Vancouver 2014 Green 260 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160

Threshold: sum() >= 300

Page 199: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Example: Sorting on Location

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 199

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2015 Yellow 300 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160 Vancouver 2014 Green 260 Victoria 2014 Red 400

Sum(Seattle, *, *) = 280 ✗ Sum(Vancouver, *, *) = 1000 ✓ Sum(Victoria, *, *) = 400 ✓

Page 200: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Sorting on Year for Vancouver

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 200

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2014 Green 260 Vancouver 2015 Yellow 300 Vancouver 2015 Red 160 Victoria 2014 Red 400

Sum(Vancouver, 2014, *) = 540 ✓ Sum(Vancouver, 2015, *) = 460 ✓

Page 201: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Color on Vancouver & 2014/2015

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 201

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Green 260 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160 Vancouver 2015 Yellow 300 Victoria 2014 Red 400

Sum(Vancouver, 2014, Yellow) = 280 ✗ Sum(Vancouver, 2014, Green) = 260 ✗ Sum(Vancouver, 2015, Yellow) = 300 ✓Sum(Vancouver, 2015, Red) = 160 ✗

Page 202: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Sort on Color for Vancouver

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 202

Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Green 260 Vancouver 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Yellow 300 Victoria 2014 Red 400

Sum(Vancouver, *, Green) = 260 ✗ Sum(Vancouver, *, Red) = 160 ✗ Sum(Vancouver, *, Yellow) = 580 ✓

Page 203: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 203

How to Sort the Base Table?

•  General sorting in main memory O(nlogn) •  Counting in main memory O(n), linear to the

number of tuples in the base table – How to sort 1 million integers in range 0 to 100? – Set up 100 counters, initiate them to 0’s – Scan the integers once, count the occurrences

of each value in 1 to 100 – Scan the integers again, put the integers to the

right places

Page 204: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 204

Pushing Monotonic Conditions

•  BUC searches the aggregates bottom-up in depth-first manner

•  Only when a monotonic condition holds, the descendants of the current node should be expanded

Page 205: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Clustering

Page 206: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Community Detection

Jian Pei: CMPT 741/459 Clustering (1) 206

http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811

Page 207: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Customer Relation Management

•  Partitioning customers into groups such that customers within a group are similar in some aspects

•  A manager can be assigned to a group •  Customized products and services can be

developed

Jian Pei: CMPT 741/459 Clustering (1) 207

Page 208: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 208

What Is Clustering?

•  Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes

Cluster 1 Cluster 2

Outliers

Page 209: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 209

Requirements of Clustering

•  Scalability •  Ability to deal with various types of attributes •  Discovery of clusters with arbitrary shape •  Minimal requirements for domain knowledge

to determine input parameters

Page 210: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 210

Data Matrix

•  For memory-based clustering – Also called object-by-variable structure

•  Represents n objects with p variables (attributes, measures) – A relational table

⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢

npxnfxnx

ipxifxix

pxfxx

!!"""""

!!"""""

!!

1

1

1111

Page 211: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 211

Dissimilarity Matrix

•  For memory-based clustering – Also called object-by-object structure – Proximities of pairs of objects – d(i, j): dissimilarity between objects i and j – Nonnegative – Close to 0: similar

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0,2)(,1)(

0(3,2)(3,1)0(2,1)

0

!!"""

ndnd

ddd

Page 212: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 212

How Good Is Clustering?

•  Dissimilarity/similarity depends on distance function – Different applications have different functions

•  Judgment of clustering quality is typically highly subjective

Page 213: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 213

Types of Data in Clustering

•  Interval-scaled variables •  Binary variables •  Nominal, ordinal, and ratio variables •  Variables of mixed types

Page 214: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 214

Interval-valued Variables

•  Continuous measurements of a roughly linear scale – Weight, height, latitude and longitude

coordinates, temperature, etc. •  Effect of measurement units in attributes

– Smaller unit à larger variable range à larger effect to the result

– Standardization + background knowledge

Page 215: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 215

Standardization

•  Calculate the mean absolute deviation

•  Calculate the standardized measurement (z-score)

•  Mean absolute deviation is more robust – The effect of outliers is reduced but remains

detectable

|)|...|||(|1 21 fnffffff mxmxmxns −++−+−= .)...211

nffff xx(xn m +++=

f

fifif s

mx z

−=

Page 216: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 216

Similarity and Dissimilarity

•  Distances are normally used measures •  Minkowski distance: a generalization

•  If q = 2, d is Euclidean distance •  If q = 1, d is Manhattan distance •  If q = ∞, d is Chebyshev distance •  Weighed distance

)0(||...||||),(2211

>−++−+−= qjxixjxixjxixjid qq

pp

qq

)0()||...||2

||1

),(2211

>−++−+−= qjxixpwjxixwjxixwjid qq

pp

qq

Page 217: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 217

Manhattan and Chebyshev Distance

Picture from Wekipedia

Manhattan Distance

http://brainking.com/images/rules/chess/02.gif

Chebyshev Distance

When n = 2, chess-distance

Page 218: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 218

Properties of Minkowski Distance

•  Nonnegative: d(i,j) ≥ 0 •  The distance of an object to itself is 0

– d(i,i) = 0 •  Symmetric: d(i,j) = d(j,i) •  Triangular inequality

– d(i,j) ≤ d(i,k) + d(k,j) i j

k

Page 219: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 219

Binary Variables

•  A contingency table for binary data •  Symmetric variable: each state carries the

same weight –  Invariant similarity

•  Asymmetric variable: the positive value carries more weight – Noninvariant similarity (Jacard)

tsrqsr jid +++

+=),(

srqsr jid ++

+=),(

Object j

Object i

1 0 Sum 1 q r q+r 0 s t s+t

Sum q+s r+t p

Page 220: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 220

Nominal Variables

•  A generalization of the binary variable in that it can take more than 2 states, e.g., Red, yellow, blue, green

•  Method 1: simple matching – M: # of matches, p: total # of variables

•  Method 2: use a large number of binary variables – Creating a new binary variable for each of the M

nominal states

pmpjid −=),(

Page 221: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 221

Ordinal Variables

•  An ordinal variable can be discrete or continuous

•  Order is important, e.g., rank •  Can be treated like interval-scaled

– Replace xif by their rank – Map the range of each variable onto [0, 1] by

replacing the i-th object in the f-th variable by

– Compute the dissimilarity using methods for interval-scaled variables

11−−

=f

ifif M

rz

},...,1{ fif Mr ∈

Page 222: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 222

Ratio-scaled Variables

•  Ratio-scaled variable: a positive measurement on a nonlinear scale – E.g., approximately at exponential scale, such

as AeBt •  Treat them like interval-scaled variables?

– Not a good choice: the scale can be distorted! •  Apply logarithmic transformation, yif = log(xif) •  Treat them as continuous ordinal data, treat

their rank as interval-scaled

Page 223: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 223

Variables of Mixed Types

•  A database may contain all the six types of variables – Symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio •  One may use a weighted formula to combine

their effects

)(1

)()(1),(

fij

pf

fij

fij

pf d

jidδ

δ

=

=

Σ

Σ=

Page 224: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Clustering Methods

•  K-means and partitioning methods •  Hierarchical clustering •  Density-based clustering •  Grid-based clustering •  Pattern-based clustering •  Other clustering methods

Jian Pei: CMPT 741/459 Clustering (1) 224

Page 225: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 225

Partitioning Algorithms: Ideas

•  Partition n objects into k clusters –  Optimize the chosen partitioning criterion

•  Global optimal: examine all possible partitions –  (kn-(k-1)n-…-1) possible partitions, too expensive!

•  Heuristic methods: k-means and k-medoids –  K-means: a cluster is represented by the center –  K-medoids or PAM (partition around medoids): each

cluster is represented by one of the objects in the cluster

Page 226: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 226

K-means

•  Arbitrarily choose k objects as the initial cluster centers

•  Until no change, do –  (Re)assign each object to the cluster to which

the object is the most similar, based on the mean value of the objects in the cluster

– Update the cluster means, i.e., calculate the mean value of the objects for each cluster

Page 227: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 227

K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each object to the most similar center

Update the cluster means

Update the cluster means

reassign reassign

Jian Pei: Data Mining -- Clustering and Outlier Detection 33

K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Jian Pei: Data Mining -- Clustering and Outlier Detection 33

K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Page 228: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 228

Pros and Cons of K-means

•  Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t <<

n. •  Often terminate at a local optimum •  Applicable only when mean is defined

– What about categorical data? •  Need to specify the number of clusters •  Unable to handle noisy data and outliers •  Unsuitable to discover non-convex clusters

Page 229: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 229

Variations of the K-means •  Aspects of variations

–  Selection of the initial k means –  Dissimilarity calculations –  Strategies to calculate cluster means

•  Handling categorical data: k-modes –  Use mode instead of mean

•  Mode: the most frequent item(s) –  A mixture of categorical and numerical data: k-prototype

method •  EM (expectation maximization): assign a

probability of an object to a cluster (will be discussed later)

Page 230: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 230

A Problem of K-means

•  Sensitive to outliers – Outlier: objects with extremely large values

•  May substantially distort the distribution of the data

•  K-medoids: the most centrally located object in a cluster

+ +

Jian Pei: Data Mining -- Clustering and Outlier Detection 36

A Problem of K-means

• Sensitive to outliers– Outlier: objects with extremely large values

• May substantially distort the distribution of the data

• K-medoids: the most centrally located object in a cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

++

Page 231: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 231

PAM: A K-medoids Method

•  PAM: partitioning around Medoids •  Arbitrarily choose k objects as the initial medoids •  Until no change, do

–  (Re)assign each object to the cluster to which the nearest medoid

–  Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’

–  If S < 0 then swap o with o’ to form the new set of k medoids

Page 232: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 232

Swapping Cost

•  Measure whether o’ is better than o as a medoid

•  Use the squared-error criterion

– Compute Eo’-Eo

– Negative: swapping brings benefit

∑∑= ∈

=k

i Cpi

i

opdE1

2),(

Page 233: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 233

PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

Randomly select a nonmedoid object,Oramdom

Compute total cost of swapping

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

Jian Pei: Data Mining -- Clustering and Outlier Detection 39

PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

Randomly select a nonmedoid object,Oramdom

Compute total cost of swapping

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Jian Pei: Data Mining -- Clustering and Outlier Detection 39

PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

Randomly select a nonmedoid object,Oramdom

Compute total cost of swapping

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Jian Pei: Data Mining -- Clustering and Outlier Detection 39

PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

Randomly select a nonmedoid object,Oramdom

Compute total cost of swapping

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 234: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Clustering (1) 234

Pros and Cons of PAM

•  PAM is more robust than k-means in the presence of noise and outliers – Medoids are less influenced by outliers

•  PAM is efficient for small data sets but does not scale well for large data sets – O(k(n-k)2) for each iteration

Page 235: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Hierarchy

•  An arrangement or classification of things according to inclusiveness

•  A natural way of abstraction, summarization, compression, and simplification for understanding

•  Typical setting: organize a given set of objects to a hierarchy – No or very little supervision – Some heuristic quality guidances on the quality

of the hierarchy Jian Pei: CMPT 459/741 Clustering (2) 235

Page 236: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

•  Group data objects into a tree of clusters •  Top-down versus bottom-up

Jian Pei: CMPT 459/741 Clustering (2) 236

Hierarchical Clustering

Step 0 Step 1 Step 2 Step 3 Step 4

b

d c

e

a a b

d e c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES)

divisive (DIANA)

Page 237: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 237

AGNES (Agglomerative Nesting)

•  Initially, each object is a cluster •  Step-by-step cluster merging, until all objects

form a cluster – Single-link approach – Each cluster is represented by all of the objects

in the cluster – The similarity between two clusters is measured

by the similarity of the closest pair of data points belonging to different clusters

Page 238: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 238

Dendrogram

•  Show how to merge clusters hierarchically

•  Decompose data objects into a multi-level nested partitioning (a tree of clusters)

•  A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster

Page 239: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 239

DIANA (Divisive ANAlysis)

•  Initially, all objects are in one cluster •  Step-by-step splitting clusters until each

cluster contains only one object

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 240: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 240

Distance Measures

•  Minimum distance •  Maximum distance •  Mean distance •  Average distance

∑∑∈ ∈

∈∈

∈∈

=

=

=

=

i j

ji

ji

Cp Cqjijiavg

jijimean

CqCpji

CqCpji

qpdnn

CCd

mmdCCd

qpdCCd

qpdCCd

),(1),(

),(),(

),(max),(

),(min),(

,max

,min

m: mean for a cluster C: a cluster n: the number of objects in a cluster

Page 241: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 241

Challenges

•  Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical

•  High complexity O(n2) •  Integrating hierarchical clustering with other

techniques – BIRCH, CURE, CHAMELEON, ROCK

Page 242: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 242

BIRCH

•  Balanced Iterative Reducing and Clustering using Hierarchies

•  CF (Clustering Feature) tree: a hierarchical data structure summarizing object information – Clustering objects à clustering leaf nodes of the

CF tree

Page 243: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 243

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: ∑Ni=1=oi

SS: ∑Ni=1=oi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

Clustering Feature Vector

Page 244: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 244

CF-tree in BIRCH

•  Clustering features –  Summarize the statistics for a cluster –  Many cluster quality measures (e.g., radium, distance)

can be derived –  Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)

•  A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering –  A nonleaf node in a tree has descendants or “children” –  The nonleaf nodes store sums of the CFs of children

Page 245: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 245

CF Tree CF1 child1

CF3 child3

CF2 child2

CF6 child6

CF1 child1

CF3 child3

CF2 child2

CF5 child5

CF1

CF2

CF6

prev next CF1

CF2

CF4

prev next

B = 7 L = 6

Root

Non-leaf node

Leaf node Leaf node

Page 246: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 246

Parameters of a CF-tree

•  Branching factor: the maximum number of children

•  Threshold: max diameter of sub-clusters stored at the leaf nodes

Page 247: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 247

BIRCH Clustering

•  Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

•  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

Page 248: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (2) 248

Pros & Cons of BIRCH

•  Linear scalability – Good clustering with a single scan – Quality can be further improved by a few

additional scans •  Can handle only numeric data •  Sensitive to the order of the data records

Page 249: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 249

Distance-based Methods: Drawbacks

•  Hard to find clusters with irregular shapes •  Hard to specify the number of clusters •  Heuristic: a cluster must be dense

Page 250: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

How to Find Irregular Clusters?

•  Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster

•  Start from a dense area, traverse connected dense areas and discover clusters in irregular shape

Jian Pei: CMPT 459/741 Clustering (3) 250

Page 251: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 251

Directly Density Reachable

•  Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-

neighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps}

•  Core object p: |NEps(p)|≥MinPts – A core object is in a dense area

•  Point q directly density-reachable from p iff q ∈NEps(p) and p is a core object

p q

MinPts = 3 Eps = 1 cm

Page 252: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 252

Density-Based Clustering

•  Density-reachable –  Directly density reachable p1àp2, p2àp3, …, pn-1à pn –  pn density-reachable from p1

•  Density-connected –  If points p, q are density-reachable from o then p and q

are density-connected

p q

o

p

q p1

Page 253: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 253

DBSCAN

•  A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial

databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

Page 254: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 254

DBSCAN: the Algorithm

•  Arbitrary select a point p •  Retrieve all points density-reachable from p

wrt Eps and MinPts •  If p is a core point, a cluster is formed •  If p is a border point, no points are density-

reachable from p and DBSCAN visits the next point of the database

•  Continue the process until all of the points have been processed

Page 255: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 255

Challenges for DBSCAN

•  Different clusters may have very different densities

•  Clusters may be in hierarchies

Page 256: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Biclustering

•  Clustering both objects and attributes simultaneously

•  Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of

attributes – An object may participate in multiple biclusters

or no biclusters – An attribute may be involved in multiple

biclusters, or no biclusters

Jian Pei: CMPT 459/741 Clustering (3) 256

Page 257: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Application Examples

•  Recommender systems – Objects: users – Attributes: items – Values: user ratings

•  Microarray data – Objects: genes – Attributes: samples – Values: expression levels

Jian Pei: CMPT 459/741 Clustering (3) 257

nmw

gene

sample/condition

w11w

21w31w

n1w

12w

32w22w

n2w

1mw

3mw2m

Page 258: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Biclusters with Constant Values

Jian Pei: CMPT 459/741 Clustering (3) 258

11.2. CLUSTERING HIGH-DIMENSIONAL DATA 535

· · · b6 · · · b12 · · · b36 · · · b99 · · ·a1 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a33 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a86 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.

subset of products. For example, AllElectronics is highly interested in findinga group of customers who all like the same group of products. Such a clusteris a submatrix in the customer-product matrix, where all elements have a highvalue. Using such a cluster, AllElectronics can make recommendations in twodirections. First, the company can recommend products to new customerswho are similar to the customers in the cluster. Second, the company canrecommend to customers new products that are similar to those involved inthe cluster.

As with bi-clusters in a gene expression data matrix, the bi-clusters in acustomer-product matrix usually have the following characteristics:

• Only a small set of customers participate in a cluster;

• A cluster involves only a small subset of products;

• A customer can participate in multiple clusters, or may not participatein any cluster at all; and

• A product may be involved in multiple clusters, or may not be involvedin any cluster at all.

Bi-clustering can be applied to customer-product matrices to mine clusterssatisfying the above requirements.

Types of Bi-clusters

“How can we model bi-clusters and mine them?” Let’s start with some basicnotation. For the sake of simplicity, we’ll use “genes” and “conditions” torefer to the two dimensions in our discussion. Our discussion can easily beextended to other applications. For example, we can simply replace “genes” and“conditions” by “customers” and “products” to tackle the customer-product bi-clustering problem.

Let A = {a1, . . . , an} be a set of genes and B = {b1, . . . , bm} be a set ofconditions. Let E = [eij ] be a gene expression data matrix, that is, a gene-condition matrix, where 1 ≤ i ≤ n and 1 ≤ j ≤ m. A submatrix I × J is

536 CHAPTER 11. ADVANCED CLUSTER ANALYSIS

10 10 10 10 1020 20 20 20 2050 50 50 50 500 0 0 0 0

Figure 11.6: A bi-cluster with constant values on rows.

10 50 30 70 2020 60 40 80 3050 90 70 110 600 40 20 60 10

Figure 11.7: A bi-cluster with coherent values.

defined by a subset I ⊆ A of genes and a subset J ⊆ B of conditions. Forexample, in the matrix shown in Figure 11.5, {a1, a33, a86} × {b6, b12, b36, b99}is a submatrix.

A bi-cluster is a submatrix where genes and conditions follow consistentpatterns. We can define different types of bi-clusters based on such patterns:

• As the simplest case, a submatrix I × J (I ⊆ A, J ⊆ B) is a bi-clusterwith constant values if for any i ∈ I and j ∈ J , eij = c, where c is aconstant. For example, the submatrix {a1, a33, a86}× {b6, b12, b36, b99} inFigure 11.5 is a bi-cluster with constant values.

• A bi-cluster is interesting if each row has a constant value, though differ-ent rows may have different values. A bi-cluster with constant valueson rows is a submatrix I × J such that for any i ∈ I and j ∈ J , theneij = c+αi where αi is the adjustment for row i. For example, Figure 11.6shows a bi-cluster with constant values on rows.

Symmetrically, a bi-cluster with constant values on columns is asubmatrix I × J such that for any i ∈ I and j ∈ J , then eij = c + βj ,where βj is the adjustment for column j.

• More generally, a bi-cluster is interesting if the rows change in a syn-chronized way with respect to the columns and vice versa. Mathemat-ically, a bi-cluster with coherent values (also known as a pattern-based cluster) is a submatrix I × J such that for any i ∈ I and j ∈ J ,eij = c + αi + βj , where αi and βj are the adjustment for row i andcolumn j, respectively. For example, Figure 11.7 shows a bi-cluster withcoherent values.

It can be shown that I × J is a bi-cluster with coherent values if andonly if for any i1, i2 ∈ I and j1, j2 ∈ J , then ei1j1 − ei2j1 = ei1j2 − ei2j2 .Moreover, instead of using addition, we can define bi-cluster with coherent

On rows

Page 259: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Biclusters with Coherent Values

•  Also known as pattern-based clusters

Jian Pei: CMPT 459/741 Clustering (3) 259

Page 260: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Biclusters with Coherent Evolutions

•  Only up- or down-regulated changes over rows or columns

Jian Pei: CMPT 459/741 Clustering (3) 260

11.2. CLUSTERING HIGH-DIMENSIONAL DATA 537

10 50 30 70 2020 100 50 1000 3050 100 90 120 800 80 20 100 10

Figure 11.8: A bi-cluster with coherent evolutions on rows.

values using multiplication, that is eij = c · αi · βj . Clearly, bi-clusterswith constant values on rows or columns are special cases of bi-clusterswith coherent values.

• In some applications, we may only be interested in the up- or down-regulated changes across genes or conditions without constraining theexact values. A bi-cluster with coherent evolutions on rows is asubmatrix I × J such that for any i1, i2 ∈ I and j1, j2 ∈ J , (ei1j1 −ei1j2)(ei2j1 − ei2j2) ≥ 0. For example, Figure 11.8 shows a bi-cluster withcoherent evolutions on rows. Symmetrically, we can define bi-clusterswith coherent evolutions on columns.

Next, we study how to mine bi-clusters.

Bi-clustering Methods

The above specification of the types of bi-clusters only considers ideal cases. Inreal data sets, such perfect bi-clusters rarely exist. When they do exist, theyare usually very small. Instead, random noise can affect the readings of eij andthus prevent a bi-cluster in nature from appearing in a perfect shape.

There are two major types of methods for discovering bi-clusters in datathat may come with noise. Optimization-based methods conduct an it-erative search. At each iteration, the submatrix with the highest significancescore is identified as a bi-cluster. The process terminates when a user-specifiedcondition is met. Due to cost concerns in computation, greedy search is oftenemployed to find local optimal bi-clusters. Enumeration methods use a tol-erance threshold to specify the degree of noise allowed in the bi-clusters to bemined, and then tries to enumerate all submatrices of bi-clusters that satisfythe requirements. We use the δ-Cluster and MaPle algorithms as examples toillustrate these ideas.

Optimization Using the δ-Cluster Algorithm

For a submatrix, I × J , the mean of the i-th row is

eiJ =1

|J |∑

j∈J

eij . (11.16)

Coherent evolutions on rows

Page 261: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 261

Differences from Subspace Clustering

•  Subspace clustering uses global distance/similarity measure

•  Pattern-based clustering looks at patterns •  A subspace cluster according to a globally

defined similarity measure may not follow the same pattern

Page 262: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 262

Objects Follow the Same Pattern?

pScore

D1 D2

Objectblue

Obejctgreen

The less the pScore, the more consistent the objects

Page 263: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 263

Pattern-based Clusters

•  pScore: the similarity between two objects rx, ry on two attributes au, av

•  δ-pCluster (R, D): for any objects rx, ry∈R and any attributes au, av∈D,

)..()..(....

vyvxuyuxvyuy

vxux arararararararar

pScore −−−=⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

)0(....

≥≤⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡δδ

vyuy

vxux

arararar

pScore

Page 264: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 264

Maximal pCluster

•  If (R, D) is a δ-pCluster , then every sub-cluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D – An anti-monotonic property – A large pCluster is accompanied with many

small pClusters! Inefficacious •  Idea: mining only the maximal pClusters!

– A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster

Page 265: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (3) 265

Mining Maximal pClusters

•  Given – A cluster threshold δ – An attribute threshold mina – An object threshold mino

•  Task: mine the complete set of significant maximal δ-pClusters – A significant δ-pCluster has at least mino objects

on at least mina attributes

Page 266: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 266

Grid-based Clustering Methods

•  Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters

•  Several interesting methods – CLIQUE – STING – WaveCluster

Page 267: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 267

CLIQUE

•  Clustering In QUEst •  Automatically identify subspaces of a high

dimensional data space •  Both density-based and grid-based

Page 268: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 268

CLIQUE: the Ideas

•  Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-

overlapping rectangular units •  A unit is dense if the number of data points

in the unit exceeds a threshold •  A cluster is a maximal set of connected

dense units within a subspace

Page 269: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 269

CLIQUE: the Method

•  Partition the data space and find the number of points in each cell of the partition –  Apriori: a k-d cell cannot be dense if one of its (k-1)-d

projection is not dense •  Identify clusters:

–  Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests

•  Generate minimal description for the clusters –  Determine the minimal cover for each cluster

Page 270: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 270

Sala

ry

(10,

000)

age

Vaca

tion

30 50

20 30 40 50 60 age

5 4

3 1

2 6

7 0

Vaca

tion

(wee

k)

20 30 40 50 60 age

5 4

3 1

2 6

7 0

CLIQUE: An Example

Page 271: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 271

CLIQUE: Pros and Cons

•  Automatically find subspaces of the highest dimensionality with high density clusters

•  Insensitive to the order of input – Not presume any canonical data distribution

•  Scale linearly with the size of input •  Scale well with the number of dimensions •  The clustering result may be degraded at the

expense of simplicity of the method

Page 272: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 272

Bad Cases for CLIQUE

Parts of a cluster may be missed

A cluster from CLIQUE may contain noise

Page 273: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 273

Fuzzy Clustering

•  Each point xi takes a probability wij to belong to a cluster Cj

•  Requirements – For each point xi,

– For each cluster Cj

11

=∑=

k

jijw

mwm

iij <<∑

=1

0

Page 274: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 274

Fuzzy C-Means (FCM)

Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij

Repeat Compute the centroid of each cluster using the fuzzy

pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij

Until the centroids do not change (or the change is below some threshold)

Page 275: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 275

Critical Details

•  Optimization on sum of the squared error (SSE):

•  Computing centroids: •  Updating the fuzzy pseudo-partition

– When p=2

∑∑= =

=k

j

m

iji

pijk cxdistwCCSSE

1 1

21 ),(),,( …

∑∑==

=m

i

pij

m

ii

pijj wxwc

11

/

∑=

−−=k

q

pqi

pjiij cxdistcxdistw

1

11

211

2 )),(/1()),(/1(

∑=

=k

qqijiij cxdistcxdistw

1

22 ),(/1),(/1

Page 276: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 276

Choice of P

•  When p à 1, FCM behaves like traditional k-means

•  When p is larger, the cluster centroids approach the global centroid of all data points

•  The partition becomes fuzzier as p increases

Page 277: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 459/741 Clustering (4) 277

Effectiveness

Page 278: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Is a Clustering Good?

•  Feasibility – Applying any clustering methods on a uniformly

distributed data set is meaningless •  Quality

– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding

various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to

male or female is not meaningful

Jian Pei: CMPT 459/741 Clustering (4) 278

Page 279: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Major Tasks

•  Assessing clustering tendency – Are there non-random structures in the data?

•  Determining the number of clusters or other critical parameters

•  Measuring clustering quality

Jian Pei: CMPT 459/741 Clustering (4) 279

Page 280: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Uniformly Distributed Data

•  Clustering uniformly distributed data is meaningless

•  A uniformly distributed data set is generated by a uniform data distribution

Jian Pei: CMPT 459/741 Clustering (4) 280

504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

Figure 10.21: A data set that is uniformly distributed in the data space.

• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.

In the rest of this section, we discuss each of the above three topics.

10.6.1 Assessing Clustering Tendency

Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.

Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.

“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.

The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:

Page 281: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Hopkins Statistic

•  Hypothesis: the data is generated by a uniform distribution in a space

•  Sample n points, p1, …, pn, uniformly from the space of D

•  For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D

Jian Pei: CMPT 459/741 Clustering (4) 281

xi = minv2D

{dist(pi, v)}

Page 282: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Hopkins Statistic

•  Sample n points, q1, …, qn, uniformly from D •  For each qi, find the nearest neighbor of qi in

D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}

•  Calculate the Hopkins Statistic H

Jian Pei: CMPT 459/741 Clustering (4) 282

yi = minv2D,v 6=qi

{dist(qi, v)}

H =

nPi=1

yi

nPi=1

xi +nP

i=1yi

Page 283: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Explanation

•  If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5

•  If D is skewed, then would be substantially smaller, and thus H would be close to 0

•  If H > 0.5, then it is unlikely that D has statistically significant clusters

Jian Pei: CMPT 459/741 Clustering (4) 283

nX

i=1

yi

nX

i=1

xi

nX

i=1

yi

Page 284: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Finding the Number of Clusters

•  Depending on many factors – The shape and scale of the distribution in the

data set – The clustering resolution required by the user

•  Many methods exist – Set , each cluster has points on

average – Plot the sum of within-cluster variances with

respect to k, find the first (or the most significant turning point)

Jian Pei: CMPT 459/741 Clustering (4) 284

k =

rn

2

p2n

Page 285: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

A Cross-Validation Method

•  Divide the data set D into m parts •  Use m – 1 parts to find a clustering •  Use the remaining part as the test set to test

the quality of the clustering – For each point in the test set, find the closest

centroid or cluster center – Use the squared distances between all points in the

test set and the corresponding centroids to measure how well the clustering model fits the test set

•  Repeat m times for each value of k, use the average as the quality measure

Jian Pei: CMPT 459/741 Clustering (4) 285

Page 286: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Measuring Clustering Quality

•  Ground truth: the ideal clustering determined by human experts

•  Two situations – There is a known ground truth – the extrinsic

(supervised) methods, comparing the clustering against the ground truth

– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated

Jian Pei: CMPT 459/741 Clustering (4) 286

Page 287: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Quality in Extrinsic Methods

•  Cluster homogeneity: the more pure the clusters in a clustering, the better the clustering

•  Cluster completeness: objects in the same cluster in the ground truth should be clustered together

•  Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag

•  Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one

Jian Pei: CMPT 459/741 Clustering (4) 287

Page 288: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Bcubed Precision and Recall

•  D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth

•  C is a clustering on D – C(oi) is the cluster-id of oi in C

•  For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise

Jian Pei: CMPT 459/741 Clustering (4) 288

Page 289: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Bcubed Precision and Recall

•  Precision

•  Recall

Jian Pei: CMPT 459/741 Clustering (4) 289

508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ = C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.

BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.

Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i = j), the correctness of the relation between oi and oj inclustering C is given by

Correctness(oi, oj) ={ 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)

0 otherwise.(10.28)

BCubed precision is defined as

Precision BCubed =

n∑

i=1

oj :i=j,C(oi)=C(oj)

Correctness(oi, oj)

∥{oj|i = j, C(oi) = C(oj)}∥n

. (10.29)

10.6. EVALUATION OF CLUSTERING 509

BCubed recall is defined as

Recall BCubed =

n∑

i=1

oj :i=j,L(oi)=L(oj)

Correctness(oi, oj)

∥{oj|i = j, L(oi) = L(oj)}∥n

. (10.30)

Intrinsic Methods

When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.

The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then

a(o) =

o′∈Ci,o=o′ dist(o, o′)

|Ci|− 1(10.31)

and

b(o) = minCj :1≤j≤k,j =i

{

o′∈Cjdist(o, o′)

|Cj |}. (10.32)

The silhouette coefficient of o is then defined as

s(o) =b(o)− a(o)

max{a(o), b(o)} . (10.33)

The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.

To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures

Page 290: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Silhouette Coefficient

•  No ground truth is assumed •  Suppose a data set D of n objects is partitioned

into k clusters, C1, …, Ck •  For each object o,

– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better

– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better

Jian Pei: CMPT 459/741 Clustering (4) 290

Page 291: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Silhouette Coefficient

•  Then

•  Use the average silhouette coefficient of all objects as the overall measure

Jian Pei: CMPT 459/741 Clustering (4) 291

b(o) = minCj :o 62Cj

{

Po

02Cj

dist(o, o0)

|Cj

| }

a(o) =

Po,o

02Ci,o0 6=o

dist(o, o0)

|Ci

|� 1

s(o) =

b(o)� a(o)

max{a(o), b(o)}

Page 292: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Classification

Page 293: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 293

Classification and Prediction

•  Classification: predict categorical class labels – Build a model for a set of classes/concepts – Classify loan applications (approve/decline)

•  Prediction: model continuous-valued functions – Predict the economic growth in 2015

Page 294: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 294

Classification: A 2-step Process

•  Model construction: describe a set of predetermined classes –  Training dataset: tuples for model construction

•  Each tuple/sample belongs to a predefined class

–  Classification rules, decision trees, or math formulae

•  Model application: classify unseen objects –  Estimate accuracy of the model using an independent

test set –  Acceptable accuracy à apply the model to classify

tuples with unknown class labels

Page 295: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 295

Model Construction

Training Data

Classification Algorithms

IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier (Model)

Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes

Dave Ass. Prof 6 No Anne Asso. Prof 3 No

Page 296: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 296

Model Application

Classifier

Testing Data Unseen Data

(Jeff, Professor, 4)

Tenured? Name Rank Years Tenured Tom Ass. Prof 2 No

Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes

Page 297: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 297

Supervised/Unsupervised Learning

•  Supervised learning (classification) – Supervision: objects in the training data set have

labels – New data is classified based on the training set

•  Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc.

with the aim of establishing the existence of classes or clusters in the data

Page 298: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 298

Data Preparation

•  Data cleaning – Preprocess data in order to reduce noise and

handle missing values •  Relevance analysis (feature selection)

– Remove the irrelevant or redundant attributes •  Data transformation

– Generalize and/or normalize data

Page 299: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 299

Measurements of Quality

•  Prediction accuracy •  Speed and scalability

– Construction speed and application speed •  Robustness: handle noise and missing

values •  Scalability: build model for large training data

sets •  Interpretability: understandability of models

Page 300: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 300

Decision Tree Induction

•  Decision tree representation •  Construction of a decision tree •  Inductive bias and overfitting •  Scalable enhancements for large databases

Page 301: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 301

Decision Tree

•  A node in the tree – a test of some attribute •  A branch: a possible value of the attribute •  Classification

– Start at the root – Test the attribute – Move down the tree branch

Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Yes Wind

Strong Weak

No Yes

Page 302: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 302

Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No

Page 303: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 303

Appropriate Problems

•  Instances are represented by attribute-value pairs – Extensions of decision trees can handle real-

valued attributes •  Disjunctive descriptions may be required •  The training data may contain errors or

missing values

Page 304: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 304

Basic Algorithm ID3

•  Construct a tree in a top-down recursive divide-and-conquer manner –  Which attribute is the best at the current node? –  Create a node for each possible attribute value –  Partition training data into descendant nodes

•  Conditions for stopping recursion –  All samples at a given node belong to the same class –  No attribute remained for further partitioning

•  Majority voting is employed for classifying the leaf

–  There is no sample at the node

Page 305: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 305

Which Attribute Is the Best?

•  The attribute most useful for classifying examples

•  Information gain and gini index – Statistical properties – Measure how well an attribute separates the

training examples

Page 306: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 306

Entropy

•  Measure homogeneity of examples

– S is the training data set, and pi is the proportion of S belong to class i

•  The smaller the entropy, the purer the data set

∑=

−≡c

iii ppSEntropy

12log)(

Page 307: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 307

Information Gain

•  The expected reduction in entropy caused by partitioning the examples according to an attribute

∑∈

−≡)(

)(||||)(),(

AValuesvv

v SEntropySSSEntropyASGain

Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v

Page 308: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 308

Example Outlook Temp Humid Wind PlayTenni

s Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No 94.0145log

145

149log

149)( 22

=

−−=SEntropy

048.000.1146811.0

14894.0

)(146)(

148)(

)(||||)(),(

},{

=×−×−=

−−=

−= ∑∈

StrongWeak

StrongWeakvv

v

SEngropySEngropySEntropy

SEntropySSSEntropyWindSGain

Page 309: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 309

Hypothesis Space Search in Decision Tree Building •  Hypothesis space: the set of possible

decision trees •  ID3: simple-to-complex, hill-climbing search

– Evaluation function: information gain

Page 310: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 310

Capabilities and Limitations

•  The hypothesis space is complete •  Maintains only a single current hypothesis •  No backtracking

– May converge to a locally optimal solution •  Use all training examples at each step

– Make statistics-based decisions – Not sensitive to errors in individual example

Page 311: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 311

Natural Bias

•  The information gain measure favors attributes with many values

•  An extreme example – Attribute “date” may have the highest

information gain – A very broad decision tree of depth one –  Inapplicable to any future data

Page 312: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 312

Alternative Measures

•  Gain ratio: penalize attributes like date by incorporating split information – 

•  Split information is sensitive to how broadly and uniformly the attribute splits the data

–  •  Gain ratio can be undefined or very large

– Only test attributes with over average gain

||||log

||||),(

12 SS

SSASmationSplitInfor i

c

i

i∑=

−≡

),(),(),(

ASmationSplitInforASGainASGainRatio ≡

Page 313: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 313

Measuring Inequality

Lorenz Curve X-axis: quintiles Y-axis: accumulative share of income earned by the plotted quintile Gap between the actual lines and the mythical line: the degree of inequality

Gini index

Gini = 0, even distribution Gini = 1, perfectly unequal The greater the distance, the more unequal the distribution

Page 314: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 314

Gini Index (Adjusted)

•  A data set S contains examples from n classes

– pj is the relative frequency of class j in S •  A data set S is split into two subsets S1 and

S2 with sizes N1 and N2 respectively

•  The attribute provides the smallest ginisplit(T) is chosen to split the node

∑=

−=n

jp jTgini121)(

)()()( 22

11 Tgini

NNTgini

NNTginisplit +=

Page 315: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 315

Extracting Classification Rules

•  Classification rules can be extracted from a decision tree

•  Each path from the root to a leaf à an IF-THEN rule – All attribute-value pair along a path form a

conjunctive condition – The leaf node holds the class prediction –  IF age = “<=30” AND student = “no” THEN

buys_computer = “no” •  Rules are easy to understand

Page 316: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 316

Inductive Bias

•  The set of assumptions that, together with the training data, deductively justifies the classification to future instances – Preferences of the classifier construction

•  Shorter trees are preferred over longer trees •  Trees that place high information gain

attributes close to the root are preferred

Page 317: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 317

Why Prefer Short Trees?

•  Occam’s razor: prefer the simplest hypothesis that fits the data

•  Fewer short trees than long trees •  A short tree is less likely to be a statistical

coincidence

“One should not increase, beyond what is necessary, the number of entities required to explain anything” – Also known as the principle of parsimony

Page 318: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (1) 318

Overfitting

•  A decision tree T may overfit the training data –  if there exists an alternative tree T’ such that T

has a higher accuracy than T’ over the training examples, but T’ has a higher accuracy than T over the entire distribution of data

•  Why overfitting? – Noise data – Bias in training data All data Training data

T T’

Page 319: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 319

The Evaluation Issues

•  The accuracy of a classifier can be evaluated using a test data set – The test set is a part of the available labeled

data set •  But how can we evaluate the accuracy of a

classification method? – A classification method can generate many

classifiers •  What if the available labeled data set is too

small?

Page 320: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 320

Holdout Method

•  Partition the available labeled data set into two disjoint subsets: the training set and the test set – 50-50 – 2/3 for training and 1/3 for testing

•  Build a classifier using the training set •  Evaluate the accuracy using the test set

Page 321: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 321

Limitations of Holdout Method

•  Fewer labeled examples for training •  The classifier highly depends on the

composition of the training and test sets – The smaller the training set, the larger the

variance •  If the test set is too small, the evaluation is

not reliable •  The training and test sets are not

independent

Page 322: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 322

Cross-Validation

•  Each record is used the same number of times for training and exactly once for testing

•  K-fold cross-validation –  Partition the data into k equal-sized subsets –  In each round, use one subset as the test set, and use

the rest subsets together as the training set –  Repeat k times –  The total error is the sum of the errors in k rounds

•  Leave-one-out: k = n –  Utilize as much data as possible for training –  Computationally expensive

Page 323: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 323

Accuracy Can Be Misleading …

•  Consider a data set of 99% of the negative class and 1% of the positive class

•  A classifier predicts everything negative has an accuracy of 99%, though it does not work for the positive class at all!

•  Imbalance class distribution is popular in many applications – Medical applications, fraud detection, …

Page 324: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 324

Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

Confusion matrix (contingency table, error matrix): used for imbalance class distribution

Page 325: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 325

Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

True positive rate (TPR, sensitivity) = TP / (TP + FN) True negative rate (TNR, specificity) = TN / (TN + FP) False positive rate (FNR) = FP / (TN + FP) False negative rate (FNR) = FN / (TP + FN)

Page 326: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 326

Recall and Precision

•  Target class is more important than the other classes

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

Precision p = TP / (TP + FP) Recall r = TP / (TP + FN)

Page 327: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 327

Fallout

•  Type I errors – false positive: a negative object is classified as positive – Fallout: the type I error rate, FP / (TP + FP)

•  Type II errors – false negative: a positive object is classified as negative – Captured by recall

Page 328: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 328

Fβ Measure

•  How can we summarize precision and recall into one metric? –  Using the harmonic mean between the two

•  Fβ measure

–  β = 0, Fβ is the precision –  β = ∞, Fβ is the recall –  0 < β < ∞, Fβ is a tradeoff between the precision and the

recall

FNFPTPTP

prrp

++=

+=

222(F) measure-F

Fβ =(β 2 +1)rpr +β 2p

=(β 2 +1)TP

(β 2 +1)TP +β 2FN +FP

Page 329: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 329

Weighted Accuracy

•  A more general metric

dwcwbwawdwaw

4321

41Accuracy Weighted+++

+=

Measure w1 w2 w3 w4 Recall 1 1 0 0

Precision 1 0 1 0

Fβ β2 + 1 β2 1 0

Accuracy 1 1 1 1

Page 330: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 330

ROC Curve

•  Receiver Operating Characteristic (ROC) 1-dimensional data set containing 2 classes. Any points located at x > t is classified as positive

Page 331: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 331

ROC Curve (TP,FP): •  (0,0): declare everything

to be negative class •  (1,1): declare everything

to be positive class •  (1,0): ideal •  Diagonal line:

–  Random guessing –  Below diagonal line:

prediction is opposite of the true class Figure from [Tan, Steinbach, Kumar]

Page 332: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 332

Comparing Two Classifiers

Figure from [Tan, Steinbach, Kumar]

Page 333: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 333

Cost-Sensitive Learning

•  In some applications, misclassifying some classes may be disastrous – Tumor detection, fraud detection

•  Using a cost matrix PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes -1 100 Class=No 1 0

Page 334: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 334

Sampling for Imbalance Classes

•  Consider a data set containing 100 positive examples and 1,000 negative examples

•  Undersampling: use a random sample of 100 negative examples and all positive examples –  Some useful negative examples may be lost –  Run undersampling multiple times, use the ensemble of

multiple base classifiers –  Focused undersampling: remove negative samples that

are not useful for classification, e.g., those far away from the decision boundary

Page 335: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (2) 335

Oversampling

•  Replicate the positive examples until the training set has an equal number of positive and negative examples

•  For noisy data, may cause overfitting

Page 336: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 336

Errors in Classification

•  Bias: the difference between the real class boundary and the decision boundary of a classification model

•  Variance: variability in the training data set •  Intrinsic noise in the target class: the target

class can be non-deterministic – instances with the same attribute values can have different class labels

Page 337: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 337

One or More?

•  What if a medical doctor is not sure about a case? –  Joint-diagnosis: using a group of doctors carrying

different expertise –  Wisdom from crowd is often more accurate

•  All eager learning methods make prediction using a single classifier induced from training data –  A single classifier may have low confidence in some

cases •  Ensemble methods: construct a set of base

classifiers and take a vote on predictions in classification

Page 338: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 338

Ensemble Classifiers Original

Training data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers C*(x)=Vote(C1(x), …, Ck(x))

Figure from [Tan, Steinbach, Kumar]

Page 339: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 339

Why May Ensemble Method Work?

•  Suppose there are two classes and each base classifier has an error rate of 35%

•  What if we use 25 base classifiers? –  If all base classifiers are identical, the ensemble

error rate is still 35% –  If base classifiers are independent, the

ensemble makes a wrong prediction only if more than half of the base classifiers are wrong

∑=

− =⎟⎟⎠

⎞⎜⎜⎝

⎛25

13

25 06.065.035.025

i

ii

i

Page 340: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 340

Ensemble Error Rate

Figure from [Tan, Steinbach, Kumar]

Page 341: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 341

Ensemble Classifiers – When?

•  The base classifiers should be independent of each other

•  Each base classifier should do better than a classifier that performs random guessing

Page 342: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 342

How to Construct Ensemble?

•  Manipulating the training set: derive multiple training sets and build a base classifier on each

•  Manipulating the input features: use only a subset of features in a base classifier

•  Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote

•  Manipulating the learning algorithm, e.g., using different network configuration in ANN

Page 343: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 343

Bootstrap

•  Given an original training set T, derive a tranining set T’ by repeatedly uniformly sampling with replacement

•  If T has n tuples, each tuple has a probability p = 1 - (1 - 1/n)n of being selected in T’ – When n à ∞, p à 1 - 1/e ≈ 0.632

•  Use the tuples not in T’ as the test set

Page 344: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 344

Bootstrap

•  Use a bootstrap sample as the training set, use the tuples not in the training set as the test set

•  .632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set

)368.0632.0(11

632. all

k

ibootstrap acck

acc ×+×= ∑ ε

Page 345: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 345

Bagging •  Run bootstrap k times to obtain k base classifiers •  A test instance is assigned to the class that

receives the highest number of votes •  Strength: reduce the variance of base classifiers –

good for unstable base classifiers –  Unstable classifiers: sensitive to minor perturbations in

the training set, e.g., decision trees, associative classifiers, and ANN

•  For stable classifiers (e.g., linear discriminant analysis and kNN classifiers), bagging may even degrade the performance since the training sets are smaller

•  Less overfitting on noisy data

Page 346: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 346

Boosting •  Assign a weight to each training example

–  Initially, each example is assigned a weight 1/n •  Weights can be used in one of the following ways

–  Weights as a sampling distribution to draw a set of bootstrap samples from the original training set

–  Weights used by a base classifier to learn a model biased towards heavier examples

•  Adaptively change the weight at the end of each boosting round –  The weight of an example correctly classified decreases –  The weight of an example incorrectly classified

increases •  Each round generates a base classifier

Page 347: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 347

Critical Design Choices in Boosting

•  How the weights of the training examples are updated at the end of each boosting round?

•  How the predictions made by base classifiers are combined?

Page 348: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 348

AdaBoost

•  Each base classifier carries an importance score related to its error rate – Error rate

– wi: weight, I(p) = 1 if p is true –  Importance score

( )∑=

≠=N

jjjiji yxCIw

N 1)(1

ε

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

i

ii ε

εα

1ln21

Page 349: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 349

How Does Importance Score Work?

Page 350: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (3) 350

Weight Adjustment in AdaBoost

–  If any intermediate rounds generate an error rate more than 50%, the weights are reverted back to 1/n

•  The ensemble error rate is bounded

∑ =

⎪⎩

⎪⎨⎧

==

+

−+

i

)1(

)()1(

1 factor,ion normalizat theis where

)( ifexp)( ifexp

jij

iij

iij

j

jij

i

wZ

yxCyxC

Zww

j

j

α

α

∏ −≤i

iiensemblee )1( εε

Page 351: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 351

Intuition – Bayesian Classification

•  More hockey fans in Canada than in US –  Which country is Tom, a hockey ball fan, from? –  Predicting Canada has a better chance to be right

•  Prior probability P(Canadian)=5%: reflect background knowledge 5% of total population is Canadians

•  P(hockey fan | Canadian)=30%: the probability of a Canadian who is also a hockey fan

•  Posterior probability P(Canadian | hockey fan): the probability of a hockey fan is from Canada

Page 352: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 352

Bayes Theorem

•  Find the maximum a posteriori (MAP) hypothesis

– Require background knowledge – Computational cost

)()()|()|(

DPhPhDPDhP =

)()|(max)()()|(max)|(max

hPhDPDP

hPhDPDhPh

Hh

HhHhMAP

∈∈

=

=≡

Page 353: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 353

Naïve Bayes Classifier

•  Assumption: attributes are independent •  Given a tuple (a1, a2, …, an), predict its

class as

–  : the value of x that maximizes f(x) •  Example:

∏=

=

jiji

i

iini

CaPCP

CPCaaaPC

)|()(maxarg

)()|,,,(maxarg 21 …

)(maxarg xf3maxarg 2

}3,2,1{−=

−∈x

x

Page 354: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 354

Example: Training Dataset

Data sample X = (Outlook=sunny, Temp=mild, Humid=high Wind=weak) Will she play tennis? Yes

Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No

P(Yes|X) = P(X|Yes) P(Yes) = 0.014 P(No|X) = P(X|No) P(No) = 0.007

Page 355: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Probability of Infrequent Values

•  (outlook = Sunny, temp = high, humid = low, wind = weak)?

•  P(humid = low) = 0

Jian Pei: CMPT 741/459 Classification (4) 355

Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No

Page 356: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Smoothing

•  Suppose an attribute has n different values: a1, …, an

•  Assume a small enough value ε > 0 •  Let Pi be the frequency of ai,

Pi = # tuples having ai / total # of tuples •  Estimate

Jian Pei: CMPT 741/459 Classification (4) 356

P (ai) = ✏+1� n✏

nPi

Page 357: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Characteristics of Naïve Bayes

•  Robust to isolated noise points – Such points are averaged out in probability

computation •  Insensitive to missing values •  Robust to irrelevant attributes

– Distributions on such attributes are almost uniform

•  Correlated attributes degrade the performance

Jian Pei: CMPT 741/459 Classification (4) 357

Page 358: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Bayes Error Rate

•  The error rate of the ideal naïve Bayes classifier

Jian Pei: CMPT 741/459 Classification (4) 358

Err =

xZ

0

P (Crocodile | X)dX +

1Z

x

P (Alligator | X)dX

Page 359: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 359

Pros and Cons

•  Pros – Easy to implement – Good results obtained in many cases

•  Cons – A (too) strong assumption: independent

attributes •  How to handle dependent/correlated

attributes? – Bayesian belief networks

Page 360: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 360

Associative Classification

•  Mine association possible rules (PR) in form of condset à c – Condset: a set of attribute-value pairs – C: class label

•  Build classifier – Organize rules according to decreasing

precedence based on confidence and support •  Classification

– Use the first matching rule to classify an unknown case

Page 361: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 361

Associative Classification Methods

•  CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) –  Mine association possible rules in the form of

•  Cond-set (a set of attribute-value pairs) à class label

–  Build classifier: Organize rules according to decreasing precedence based on confidence and then support

•  CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) –  Classification: Statistical analysis on multiple rules

Page 362: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 362

Instance-based Methods

•  Instance-based learning –  Store training examples and delay the processing until a

new instance must be classified (“lazy evaluation”) •  Typical approaches

–  K-nearest neighbor approach •  Instances represented as points in an Euclidean space

–  Locally weighted regression •  Construct local approximation

–  Case-based reasoning •  Use symbolic representations and knowledge-based inference

Page 363: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 363

The K-Nearest Neighbor Method

•  Instances are points in an n-D space •  The k-nearest neighbors (KNN) in the

Euclidean distance – Return the most common value among the k

training examples nearest to the query point •  Discrete-/real-valued target functions

. _

+ _ xq

+

_ _ +

_

_

+

Page 364: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 364

KNN Methods

•  For continuous-valued target functions, return the mean value of the k nearest neighbors

•  Distance-weighted nearest neighbor algorithm –  Give greater weights to closer neighbors

•  Robust to noisy data by averaging k-nearest neighbors

•  Curse of dimensionality –  Distance could be dominated by irrelevant attributes –  Axes stretch or elimination of the least relevant attributes

wd xq xi

≡ 12( , )

Page 365: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Classification (4) 365

Lazy vs. Eager Learning

•  Efficiency: lazy learning uses less training time but more predicting time

•  Accuracy – Lazy method effectively uses a richer hypothesis

space – Eager: must commit to a single hypothesis that

covers the entire instance space

Page 366: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Outlier Detection

Page 367: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Motivation: Fraud Detection

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 367

http://i.imgur.com/ckkoAOp.gif

Page 368: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Techniques: Fraud Detection

•  Features •  Dissimilarity •  Groups and noise

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 368

http://i.stack.imgur.com/tRDGU.png

Page 369: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 369

Outlier Analysis

•  “One person’s noise is another person’s signal”

•  Outliers: the objects considerably dissimilar from the remainder of the data – Examples: credit card fraud, Michael Jordon,

intrusions, etc – Applications: credit card fraud detection, telecom

fraud detection, intrusion detection, customer segmentation, medical analysis, etc

Page 370: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Outliers and Noise

•  Different from noise – Noise is random error or variance in a measured

variable •  Outliers are interesting: an outlier violates

the mechanism that generates the normal data

•  Outlier detection vs. novelty detection – Early stage may be regarded as outliers – But later merged into the model

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 370

Page 371: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Types of Outliers

•  Three kinds: global, contextual and collective outliers – A data set may have multiple types of outlier – One object may belong to more than one type of

outlier •  Global outlier (or point anomaly)

– An outlier object significantly deviates from the rest of the data set

•  challenge: find an appropriate measurement of deviation

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 371

Page 372: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Contextual Outliers •  An outlier object deviates significantly based on a

selected context –  Ex. Is 10C in Vancouver an outlier? (depending on summer or

winter?) •  Attributes of data objects should be divided into two

groups –  Contextual attributes: defines the context, e.g., time & location –  Behavioral attributes: characteristics of the object, used in

outlier evaluation, e.g., temperature •  A generalization of local outliers—whose density

significantly deviates from its local area •  Challenge: how to define or formulate meaningful

context?

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 372

Page 373: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Collective Outliers

•  A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers – Application example: intrusion detection when a

number of computers keep sending denial-of-service packages to each other

•  Detection of collective outliers – Consider not only behavior of individual objects, but

also that of groups of objects – Need to have the background knowledge on the

relationship among data objects, such as a distance or similarity measure on objects

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 373

Page 374: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Outlier Detection: Challenges

•  Modeling normal objects and outliers properly – Hard to enumerate all possible normal behaviors in

an application – The border between normal and outlier objects is

often a gray area •  Application-specific outlier detection

– Choice of distance measure among objects and the model of relationship among objects are often application-dependent

– Example: clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 374

Page 375: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Outlier Detection: Challenges

•  Handling noise in outlier detection – Noise may distort the normal objects and blur the

distinction between normal objects and outliers – Noise may help hide outliers and reduce the

effectiveness of outlier detection •  Understandability

– Understand why these are outliers: Justification of the detection

– Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 375

Page 376: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Outlier Detection Methods

•  Whether user-labeled examples of outliers can be obtained – Supervised, semi-supervised, and unsupervised

methods •  Assumptions about normal data and outliers

– Statistical, proximity-based, and clustering-based methods

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 376

Page 377: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Supervised Methods •  Modeling outlier detection as a classification problem

–  Samples examined by domain experts used for training & testing •  Methods for Learning a classifier for outlier detection effectively:

–  Model normal objects & report those not matching the model as outliers, or

–  Model outliers and treat those not matching the model as normal •  Challenges

–  Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers

–  Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers)

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 377

Page 378: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Unsupervised Methods •  Assume the normal objects are somewhat

``clustered'‘ into multiple groups, each having some distinct features

•  An outlier is expected to be far away from any groups of normal objects

•  Weakness: Cannot detect collective outlier effectively –  Normal objects may not share any strong patterns, but

the collective outliers may share high similarity in a small area

•  Many clustering methods can be adapted for unsupervised methods –  Find clusters, then outliers: not belonging to any cluster

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 378

Page 379: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Unsupervised Methods: Challenges

•  In some intrusion or virus detection, normal activities are diverse – Unsupervised methods may have a high false

positive rate but still miss many real outliers. – Supervised methods can be more effective, e.g.,

identify attacking some key resources •  Challenges

– Hard to distinguish noise from outliers – Costly since first clustering: but far less outliers than

normal objects •  Newer methods: tackle outliers directly

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 379

Page 380: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Semi-Supervised Methods •  In many applications, the number of labeled data is often

small –  Labels could be on outliers only, normal objects only, or both

•  If some labeled normal objects are available –  Use the labeled examples and the proximate unlabeled

objects to train a model for normal objects –  Those not fitting the model of normal objects are detected as

outliers •  If only some labeled outliers are available, a small

number of labeled outliers many not cover the possible outliers well –  To improve the quality of outlier detection, one can get help

from models for normal objects learned from unsupervised methods

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 380

Page 381: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Pros and Cons

•  Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data

•  There are rich alternatives to use various statistical models – Parametric vs. non-parametric

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 381

Page 382: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Proximity-based Methods

•  An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 382

Page 383: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Pros and Cons

•  The effectiveness of proximity-based methods highly relies on the proximity measure

•  In some applications, proximity or distance measures cannot be obtained easily

•  Often have a difficulty in identifying a group of outliers that stay close to each other

•  Two major types of proximity-based outlier detection methods – Distance-based vs. density-based

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 383

Page 384: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Clustering-based Methods

•  Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 384

Page 385: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Challenges

•  Since there are many clustering methods, there are many clustering-based outlier detection methods as well

•  Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 385

Page 386: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 386

Statistical Outlier Analysis

•  Assumption: the objects in a data set are generated by a (stochastic) process (a generative model)

•  Learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers

•  two categories: parametric versus non-parametric

Page 387: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Example

•  Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model – The data not following the model are outliers.

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 387

Page 388: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Parametric Methods

•  Assumption: the normal data is generated by a parametric distribution with parameter θ

•  The probability density function of the parametric distribution f(x | θ) gives the probability that object x is generated by the distribution

•  The smaller this value, the more likely x is an outlier

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 388

Page 389: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Univariate Outliers Based on Normal Distribution

•  Taking derivatives with respect to µ and σ2, we derive the following maximum likelihood estimates

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 389

lnL(µ,�2) =nX

i=1

ln f(xi | (u,�2)) = �n

2ln(2⇡)� n

2ln�2 � 1

2�2

nX

i=1

(xi � µ)2

µ = x =1

n

nX

i=1

xi �

2 =1

n

nX

i=1

(xi � x)2

Page 390: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Example

•  Daily average temperature: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}

•  Since n = 10, •  Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is

an outlier since µ ± 3σ contains 99.7% data

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 390

µ = 28.61 � =p2.29 = 1.51

Page 391: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

The Grubb’s Test

•  Maximum normed residual test •  For each object x in a data set, compute its

z-score – x is an outlier if

–  is the value taken by a t-distribution at a significance level of α/(2N), and N is the number of objects in the data set

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 391

z � N � 1pN

vuut t2↵2N ,N�2

N � 2 + t2↵2N ,N�2

t2↵2N ,N�2

Page 392: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Non-parametric Method

•  Not assume an a-priori statistical model, instead, determine the model from the input data – Not completely parameter free but consider the

number and nature of the parameters are flexible and not fixed in advance

•  Examples: histogram and kernel density estimation

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 392

Page 393: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Histogram

•  A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 393

Page 394: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Challenges

•  Hard to choose an appropriate bin size for histogram – Too small bin size → normal objects in empty/

rare bins, false positive – Too big bin size → outliers in some frequent

bins, false negative

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 394

Page 395: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Proximity-based Outlier Detection

•  Objects far away from the others are outliers •  The proximity of an outlier deviates significantly

from that of most of the others in the data set •  Distance-based outlier detection: An object o is

an outlier if its neighborhood does not have enough other points

•  Density-based outlier detection: An object o is an outlier if its density is relatively much lower than that of its neighbors

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 395

Page 396: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 396

Depth-based Methods

•  Organize data objects in layers with various depths – The shallow layers are more likely to contain

outliers •  Example: Peeling, Depth contours •  Complexity O(N⎡k/2⎤) for k-d datasets

– Unacceptable for k>2

Page 397: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 397

Depth-based Outliers: Example

Page 398: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 398

Distance-based Outliers

•  A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lie at a distance greater than distance D from O

•  The larger D, the more outlying •  The larger p, the more outlying

Page 399: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 399

Density-based Local Outlier

Both o1 and o2 are outliers Distance-based methods can detect o1, but not o2

Page 400: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Intuition

•  Outliers comparing to their local neighborhoods, instead of the global data distribution

•  The density around an outlier object is significantly different from the density around its neighbors

•  Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliers

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 400

Page 401: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Classification-based Outlier Detection

•  Train a classification model that can distinguish “normal” data from outliers

•  A brute-force approach: Consider a training set that contains some samples labeled as “normal” and others labeled as “outlier” – A training set in practice is typically heavily

biased: the number of “normal” samples likely far exceeds that of outlier samples

– Cannot detect unseen anomaly

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 401

Page 402: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

One-Class Model

•  A classifier is built to describe only the normal class •  Learn the decision boundary of the normal class

using classification methods such as SVM •  Any samples that do not belong to the normal class

(not within the decision boundary) are declared as outliers

•  Advantage: can detect new outliers that may not appear close to any outlier objects in the training set

•  Extension: Normal objects may belong to multiple classes

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 402

Page 403: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

One-Class Model

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 403

Page 404: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Semi-Supervised Learning Methods

•  Combine classification-based and clustering-based methods

•  Method –  Use a clustering-based approach to find a large cluster,

C, and a small cluster, C1 –  Since some objects in C carry the label “normal”, treat all

objects in C as normal –  Use the one-class model of this cluster to identify normal

objects in outlier detection –  Since some objects in cluster C1 carry the label “outlier”,

declare all objects in C1 as outliers –  Any object that does not fall into the model for C (such

as a) is considered an outlier as well

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 404

Page 405: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Example

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 405

Page 406: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Pros and Cons

•  Pros: Outlier detection is fast •  Cons: Quality heavily depends on the availability

and quality of the training set, •  It is often difficult to obtain representative and high-

quality training data

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 406

Page 407: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Contextual Outliers •  An outlier object deviates significantly based on a

selected context –  Ex. Is 10C in Vancouver an outlier? (depending on summer or

winter?) •  Attributes of data objects should be divided into two

groups –  Contextual attributes: defines the context, e.g., time & location –  Behavioral attributes: characteristics of the object, used in

outlier evaluation, e.g., temperature •  A generalization of local outliers—whose density

significantly deviates from its local area •  Challenge: how to define or formulate meaningful

context?

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 407

Page 408: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Detection of Contextual Outliers

•  If the contexts can be clearly identified, transform it to conventional outlier detection –  Identify the context of the object using the

contextual attributes – Calculate the outlier score for the object in the

context using a conventional outlier detection method

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 408

Page 409: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Example

•  Detect outlier customers in the context of customer groups – Contextual attributes: age group, postal code – Behavioral attributes: the number of transactions per

year, annual total transaction amount •  Method

–  Locate c’s context; – Compare c with the other customers in the same

group; and – Use a conventional outlier detection method

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 409

Page 410: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Modeling Normal Behavior

•  Model the “normal” behavior with respect to contexts –  Use a training data set to train a model that predicts the

expected behavior attribute values with respect to the contextual attribute values

–  An object is a contextual outlier if its behavior attribute values significantly deviate from the values predicted by the model

•  Use a prediction model to link the contexts and behavior –  Avoid explicit identification of specific contexts –  Some possible methods: regression, Markov Models,

and Finite State Automaton …

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 410

Page 411: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Collective Outliers

•  Objects as a group deviate significantly from the entire data

•  Examine the structure of the data set, i.e, the relationships between multiple data objects – The structures are often not explicitly defined,

and have to be discovered as part of the outlier detection process.

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 411

Page 412: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Detecting High Dimensional Outliers

•  Interpretability of outliers –  Which subspaces manifest the outliers or an

assessment regarding the “outlying-ness” of the objects •  Data sparsity: data in high-D spaces are often sparse

–  The distance between objects becomes heavily dominated by noise as the dimensionality increases

•  Data subspaces –  Local behavior and patterns of data

•  Scalability with respect to dimensionality –  The number of subspaces increases exponentially

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 412

Page 413: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories

Angle-based Outliers

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 413