19
Investigation of sub-patterns discovery and its applications By: Xun Lu Supervisor: Jiuyong Li

Investigation of sub-patterns discovery and its applications By: Xun Lu Supervisor: Jiuyong Li

Embed Size (px)

Citation preview

Investigation of sub-patterns discovery and its applications

By: Xun LuSupervisor: Jiuyong Li

Content

• Overview• Brief Introduction• Basic Definitions• STUCCO Algorithm• MORE Algorithm

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

3

Overview of this research study

• To examine and differentiate various kinds of contrast patterns

• The scope of this research is in an attempt to understand the principles and algorithms involved in sub-patterns, i.e. contrast sets discovery

• This thesis, ultimately, is trying to adopt the techniques applied in STUCCO to improve the efficiency of MORE algorithm.

Content

• Overview• Brief Introduction• Basic Definitions• STUCCO Algorithm• MORE Algorithm

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

5

What is Contrast data mining?

• Contrast – “To compare or appraise in respect to differences” (Merriam Webster Dictionary)

• Contrast data mining – The mining of patterns and models contrasting two or more classes/conditions.

6

Why Contrast data mining?

“Sometimes it's good to contrast what you like with something else. It makes you appreciate it even more”

Darby Conley, Get Fuzzy, 2001

Content

• Overview• Brief Introduction• Basic Definitions• STUCCO Algorithm• MORE Algorithm

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Some definitions for STUCCOContrast set: a conjunction of attribute-value pairs

defined on groups with no attribute occurring more than once.

Support of a cset: the ratio of the record number containing cset to the number of all records in the data set. supp(cset) ≈ prob(cset)

Group: cset with the same prefix are placed in one group

Upper bound: the support of an itemset consisting of the head of the group and one item

Lower bound: the support of an itemset consisting all the items the group

Some definitions for MORE

Contingency table Relative Risk

Present and Absent can be treated as class labels (head/prefix) whereas Smoking and Non-Smoking can be seen as the rest of elements of a contrast-set

Risk Disease Status

Present Absent

Smoking a b

Non-Smoking c d

bc

ad

dcc

baa

p

pRR

osednon

osed

)(

)(

exp

exp

Content

• Overview• Brief Introduction• Basic Definitions• STUCCO Algorithm• MORE Algorithm

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

11

STUCCO

• Search Testing for Understandable Consistent Contrast

• Developed by Bay and Pazzani. It aims to efficiently mine all the contrast sets which are significant and large, without predefined support thresholds

• It defines the support by finding the maximum difference between upper bound and lower bound within a group.

][][max,

max jLiUjiij

STUCCO pruning strategies

• Effective size pruning• this equation ensures effect size pruning by

pruning the cset with the upper bound below

• Statistical significance pruning• Chi-square • Alternative techniques: leverage/lift/relative risk/odds ratio

• Interest based pruning• Contrast sets are not interesting when they represent no

new information• E.g. marital_status=husband Λ sex=male

][][max,

max jLiUjiij

2

Interest based pruning cont’

STUCCO prunes the cset that do not satisfy either one of following conditions• (1)

• (2)

• is normally set to a very small number,say δ/2• If A and B are itemsets where A B, we also prune ⊂

the following:• If A is infrequent, prune A and B• A={1,4,6} B={1,3,4,6}, supp(B)must be less than supp(A).

)|'()|( iii GTruecsetPGTruecsetP

siii

GcsetsGcsets |),'(),(|max

s

Filtering Algorithm

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Content

• Overview• Brief Introduction• Basic Definitions• STUCCO Algorithm• MORE Algorithm

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

MORE Algorithm

• Mining Optimal Risk pattErn sets• Input: data set, minimum support and the minimum relative

risk threshold. Output: optimal risk pattern set

MORE cont’

• Advantage: it makes use of the anti-monotone property to efficiently prune the search space• anti-monotone: if (supp(Px|¬a))=supp(P|¬a)), then pattern

PX and all its super patterns do not occur in the optimal risk pattern set

• Deficiencies: • MORE requires a predefined minimum support;• The Relative Ratio results fail to show statistical error and

residuals (details next slide);• Needs to apply more techniques from STUCCO to

determine superfluous patterns.

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Statistical error and residuals

• Given this Risk pattern result generated by MORE:

RR=2.00. But this value is calculated from sample mean, which may not represent the truth of unobservable population mean.

• Hence, we need an acceptable range value, say [1.84, 2.47], instead of a singe value for RR.

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Questions?

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.