39
DECEMBER 8-9, 2016

BSSML16 L4. Association Discovery and Topic Modeling

Embed Size (px)

Citation preview

Page 1: BSSML16 L4. Association Discovery and Topic Modeling

D E C E M B E R 8 - 9 , 2 0 1 6

Page 2: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 2

Poul Petersen CIO, BigML, Inc.

Association DiscoveryFinding Interesting Correlations

Page 3: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 3Association Discovery

Association Discovery

• Algorithm: “Magnum Opus” from Geoff Webb • Unsupervised Learning: Works with unlabelled

data, like clustering and anomaly detection. • Learning Task: Find “interesting” relations

between variables.

Page 4: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 4Association Discovery

Unsupervised Learning

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

Clustering

Anomaly Detection

similar

unusual

Page 5: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 5Association Discovery

Association Rules

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

zip = 46140amount < 100

Rules:

Antecedent Consequent

{customer = Bob, account = 3421}{class = gas}

Page 6: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 6Association Discovery

Use Cases

• Market Basket Analysis

• Web usage patterns

• Intrusion detection

• Fraud detection

• Bioinformatics

• Medical risk factors

Page 7: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 7Association Discovery

Magnum Opus

• Select measure of interest: Levarage, Lift, etc• System finds the top-k associations on that

measure within constraints • Must be statistically significant interaction between

antecedent and consequent• Every item in the antecedent must increase the

strength of association

Page 8: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 8Association Discovery

Association Metrics

Instances

AC

Coverage

Percentage of instances which match antecedent “A”

Page 9: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 9Association Discovery

Association Metrics

Instances

AC

Support

Percentage of instances which match antecedent “A” and Consequent “C”

Page 10: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 10Association Discovery

Association Metrics

Coverage

Support

Instances

AC

Confidence

Percentage of instances in the antecedent which also contain the consequent.

Page 11: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 11Association Discovery

Association Metrics

CInstances

A C

A

Instances

C

Instances

A

Instances

AC

0% 100%

Instances

AC

Confidence

A never implies C

A sometimes implies C

A always implies C

Page 12: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 12Association Discovery

Association Metrics

Independent

AC

C

Observed

A

Lift

Ratio of observed support to support if A and C were statistically independent.

Support == Confidence p(A) * p(C) p(C)

Page 13: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 13Association Discovery

Association Metrics

C

Observed

A

Observed

AC

< 1 > 1

Independent

A C

Lift = 1

Negative Correlation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C

Page 14: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 14Association Discovery

Association Metrics

Independent

AC

C

Observed

A

Leverage

Difference of observed support and support if A and C were statistically independent.

Support - [ p(A) * p(C) ]

Page 15: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 15Association Discovery

Association Metrics

C

Observed

A

Observed

AC

< 0 > 0

Independent

A C

Leverage = 0

NegativeCorrelation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C

-1…

Page 16: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 16Association Discovery

Use Cases

GOAL: Discover “interesting” rules about what store items

are typically purchased together.

• Dataset of 9,834 grocery cart transactions

• Each row is a list of all items in a cart at checkout

Page 17: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 17

Association Discovery Demo #1

Page 18: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 18Association Discovery

Use Cases

GOAL: Find general rules that indicate diabetes.

• Dataset of diagnostic measurements of 768 patients.

• Each patient labelled True/False for diabetes.

Page 19: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 19

Association Discovery Demo #2

Page 20: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 20Association Discovery

Medical RisksDecision Tree

If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44

then diabetes = TRUE

Association Rule

If plasma glucose > 146 then diabetes = TRUE

Page 21: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 21

Poul Petersen CIO, BigML, Inc.

Topic ModelingDiscovering Meaning in Text

Page 22: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 22Topic Modelling

Unsupervised LearningFeatures

Inst

ance

s

• Learn from instances

• Each instance has features

• There is no label

Clustering Find similar instances

Anomaly Detection Find unusual instances

Association Discovery Find feature rules

Page 23: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 23Topic Modelling

Topic ModelText Fields

• Unsupervised algorithm

• Learns only from text fields

• Finds hidden topics that model the text

• How is this different from the Text Analysis that BigML already offers?

• What does it output and how do we use it

• Unsupervised… model?

Questions:

Page 24: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 24Topic Modelling

Text Analysis

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon 'em.

great: appears 4 times

Bag of Words

Page 25: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 25Topic Modelling

Text Analysis

… great afraid born achieve … …

… 4 1 1 1 … …

… … … … … … …

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon ‘em.

Model

The token “great” occurs more than 3 times

The token “afraid” occurs no more than once

Page 26: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 26

Topic Model Demo #1

Page 27: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 27Topic Modelling

TA vs TMText Analysis Topic Model

Creates thousands of hidden token counts

Token counts are independently uninteresting

No semantic importance

No measure of co-occurrence

Creates tens of topics that model the text

Topics are independently interesting

Semantic meaning extracted

Support for bigrams

Page 28: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 28Topic Modelling

Generative Modeling

• Decision trees are discriminative models • Aggressively model the classification boundary

• Parsimonious: Don’t consider anything you don’t have to

• Topic Models are generative models • Come up with a theory of how the data is generated

• Tweak the theory to fit your data

Page 29: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 29Topic Modelling

Generating Documents

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

shoe asteroid flashlight pizza…

plate giraffe purple jump…

Be not afraid of greatness: some are born great, some achieve greatness…

• "Machine" that generates a random word with equal probability with each pull.

• Pull random number of times to generate a document.

• All documents can be generated, but most are nonsense.

word probabilityshoe ϵ

asteroid ϵflashlight ϵ

pizza ϵ… ϵ

Page 30: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 30Topic Modelling

Topic Model• Written documents have meaning - one way to

describe meaning is to assign a topic.

• For our random machine, the topic can be thought of as increasing the probability of certain words.

Intuition:

Topic: travel

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

airplane passport pizza …

word probabilitytravel 23,55 %

airplane 2,33 %mars 0,003 %

mantle ϵ… ϵ

Topic: space

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

mars quasar lightyear soda

word probabilityspace 38,94 %

airplane ϵmars 13,43 %

mantle 0,05 %… ϵ

Page 31: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 31Topic Modelling

Topic Model

plate giraffe purple jump…

Topic: "1"

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

word probability

travel 23,55 %

airplane 2,33 %

mars 0,003 %

mantle ϵ

… ϵ

Topic: "k"

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

word probability

shoe 12,12 %

coffee 3,39 %

telephone 13,43 %

paper 4,11 %

… ϵ

…Topic: "2"

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

word probability

space 38,94 %airplane ϵ

mars 13,43 %

mantle 0,05 %

… ϵ

airplane passport pizza …

plate giraffe purple jump…

• Each text field in a row is concatenated into a document

• The documents are analyzed to generate "k" related topics that can model the documents

• Each topic is represented by a distribution of term probabilities

Page 32: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 32

Topic Model Demo #2

Page 33: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 33Topic Modelling

Use Cases

• As a preprocessor for other techniques

• Bootstrapping categories for classification

• Recommendation

• Discovery in large, heterogeneous text datasets

Page 34: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 34Topic Modelling

Topic Distribution• Any given document is likely a mixture of the

modeled topics…

• This can be represented as a distribution of topic probabilities

Intuition:

Will 2020 be the year that humans will embrace space exploration and finally travel to Mars?

Topic: travel

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

word probabilitytravel 23,55 %

airplane 2,33 %mars 0,003 %

mantle ϵ… ϵ

11%

Topic: space

cat shoe zebra ball tree jump pen asteroid

cable box step cabinet yellow

plate flashlight…

word probabilityspace 38,94 %

airplane ϵmars 13,43 %

mantle 0,05 %… ϵ

89%

Page 35: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 35

Topic Model Demo #3

Page 36: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 36Topic Modelling

Batch Topic Distribution

Unlabelled Data

Centroid Label

Unlabelled Data

topic 1 prob

topic 3 prob

topic k prob

Clustering Batch Centroid

Topic Model

Text Fields

Batch Topic Distribution

Page 37: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 37

Topic Model Demo #4

Page 38: BSSML16 L4. Association Discovery and Topic Modeling

BigML, Inc 38Topic Modelling

Tips

• Setting k • Much like k-means, the best value is data specific

• Too few will agglomerate unrelated topics, too many will partition highly related topics

• I tend to find the latter more annoying than the former

• Tuning the Model • Remove common, useless terms

• Set term limit higher, use bigrams

Page 39: BSSML16 L4. Association Discovery and Topic Modeling