View
226
Download
4
Category
Preview:
Citation preview
Information Theory, Classification & Decision Trees
Ling 572Advanced Statistical Methods in NLP
January 5, 2012
2
Information Theory
3
EntropyInformation theoretic measure
Measures information in model
Conceptually, lower bound on # bits to encode
Entropy: H(X): X is a random var, p: prob fn
)(log)()( 2 xpxpXHXx
4
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
5
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
6
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
7
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
8
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
9
Cross-EntropyComparing models
Actual distribution unknown pUse simplified model to estimate m
Closer match will have lower cross-entropy
10
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
11
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
12
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
Not a proper distance metric:
13
Relative EntropyCommonly known as Kullback-Liebler divergence
Expresses difference between probability distributions
Not a proper distance metric: asymmetricKL(p||q) != KL(q||p)
14
Joint & Conditional Entropy
Joint entropy:
15
Joint & Conditional Entropy
Joint entropy:
Conditional entropy:
16
Joint & Conditional Entropy
Joint entropy:
Conditional entropy:
17
Joint & Conditional Entropy
Joint entropy:
Conditional entropy:
18
Perplexity and Entropy
Given that
Consider the perplexity equation:
PP(W) = P(W)-1/N = = = 2H(L,P)
Where H is the entropy of the language L
19
Mutual InformationMeasure of information in common between two
distributions
20
Mutual InformationMeasure of information in common between two
distributions
21
Mutual InformationMeasure of information in common between two
distributions
22
Mutual InformationMeasure of information in common between two
distributions
23
Mutual InformationMeasure of information in common between two
distributions
Symmetric: I(X;Y) = I(Y;X)
24
Mutual InformationMeasure of information in common between two
distributions
Symmetric: I(X;Y) = I(Y;X)
I(X;Y) = KL(p(x,y)||p(x)p(y))
25
Decision Trees
26
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
27
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
Instance: (x,y)x: thing to be labeled/classifiedy: label/class
28
Classification TaskTask:
C is a finite set of labels (aka categories, classes)Given x, determine its category y in C
Instance: (x,y)x: thing to be labeled/classifiedy: label/class
Data: set of instances labeled data: y is knownunlabeled data: y is unknown
29
Classification TaskTask:
C is a finite set of labels (aka categories, classes) Given x, determine its category y in C
Instance: (x,y) x: thing to be labeled/classified y: label/class
Data: set of instances labeled data: y is known unlabeled data: y is unknown
Training data, test data
30
Two StagesTraining:
Learner: training data classifier
31
Two StagesTraining:
Learner: training data classifier
Testing:Decoder: test data + classifier classification
output
32
Two StagesTraining:
Learner: training data classifierClassifier: f(x) =y: x is input; y in C
Testing:Decoder: test data + classifier classification output
AlsoPreprocessingPostprocessingEvaluation
33
34
RoadmapDecision Trees:
Sunburn exampleDecision tree basicsFrom trees to rulesKey questions
Training procedure?Decoding procedure? Overfitting?Different feature type?
Analysis: Pros & Cons
35
Sunburn ExampleName Hair Height Weight Lotion Result
Sarah Blonde Average Light No Burn
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Burn
Emily Red Average Heavy No Burn
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Katie Blonde Short Light Yes None
36
Learning about Sunburn
Goal:Train on labeled examplesPredict Burn/None for new instances
Solution??Exact match: same features, same output
Problem: 2*3^3 feature combinations Could be much worse
Same label as ‘most similar’Problem: What’s close? Which features matter?
Many match on two features but differ on result
37
Learning about SunburnBetter Solution: Decision tree
Training:Divide examples into subsets based on feature testsSets of samples at leaves define classification
Prediction:Route NEW instance through tree to leaf based on
feature testsAssign same value as samples at leaf
38
Sunburn Decision Tree
Hair Color
Lotion Used
BlondeRed
Brown
Alex: NoneJohn: NonePete: None
Emily: Burn
No Yes
Sarah: BurnAnnie: Burn
Katie: NoneDana: None
39
Decision Tree Structure Internal nodes:
Each node is a testGenerally tests a single feature
E.g. Hair == ?
Theoretically could test multiple features
Branches: Each branch corresponds to outcome of test
E.g Hair == Red; Hair != Blond
Leaves: Each leaf corresponds to a decision
Discrete class: Classification / Decision TreeReal value: Regression
40
From Trees to RulesTree:
Branches from root to leaves =
Tests => classifications
Tests = if antecedents; Leaf labels= consequent
All Decision trees-> rules Not all rules as trees
41
From ID Trees to RulesHair Color
Lotion Used
BlondeRed
Brown
Alex: NoneJohn: NonePete: None
Emily: Burn
No Yes
Sarah: BurnAnnie: Burn
Katie: NoneDana: None
(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))
42
Which Tree?Many possible decision trees for any problem
How can we select among them?
What would be the ‘best’ tree?Smallest?
Shallowest?
Most accurate on unseen data?
43
SimplicityOccam’s Razor:
Simplest explanation that covers the data is best
Occam’s Razor for decision trees:Smallest tree consistent with samples will be
best predictor for new data
Problem: Finding all trees & finding smallest: Expensive!
Solution:Greedily build a small tree
44
Building Trees:Basic Algorithm
Goal: Build a small tree such that all samples at leaves have same class
Greedy solution:At each node, pick test using ‘best’ feature
Split into subsets based on outcomes of feature test
Repeat process until stopping criterioni.e. until leaves have same class
45
Key QuestionsSplitting:
How do we select the ‘best’ feature?
Stopping: When do we stop splitting to avoid overfitting?
Features:How do we split different types of features?
Binary? Discrete? Continuous?
46
Building Decision Trees: IGoal: Build a small tree such that all samples at
leaves have same class
Greedy solution:At each node, pick test such that branches are
closest to having same class
Split into subsets where most instances in uniform class
47
Picking a TestHair Color
BlondeRed
Brown
Alex: NPete: NJohn: N
Emily: BSarah: BDana: NAnnie: BKatie: N
Height
WeightLotion
Short AverageTall
Alex:NAnnie:BKatie:N
Sarah:BEmily:BJohn:N
Dana:NPete:N
Sarah:BKatie:N
Light AverageHeavy
Dana:NAlex:NAnnie:B
Emily:BPete:NJohn:N
No Yes
Sarah:BAnnie:BEmily:BPete:NJohn:N
Dana:NAlex:NKatie:N
48
Picking a TestHeight
WeightLotion
Short AverageTall
Annie:BKatie:N
Sarah:B Dana:N
Sarah:BKatie:N
Light AverageHeavy
Dana:NAnnie:B
No Yes
Sarah:BAnnie:B
Dana:NKatie:N
49
Measuring Disorder
Problem: In general, tests on large DB’s don’t yield
homogeneous subsets
Solution:General information theoretic measure of disorderDesired features:
Homogeneous set: least disorder = 0Even split: most disorder = 1
50
Measuring EntropyIf split m objects into 2 bins size m1 & m2,
what is the entropy?
m
m
m
m
m
m
m
mm
m
m
m iii
22
212
1
2
loglog
log
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
m1/m
Disorder
51
Measuring Disorder:Entropy
mmp ii / the probability of being in bin i
i
ii pp 2log
mmp ii /
Entropy (disorder) of a split
i
ip 1
00log0 2 Assume
10 ip
-½ log2½ - ½ log2½ = ½ +½ = 1½½
-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811¾¼
-1log21 - 0log20 = 0 - 0 = 001
Entropyp2p1
52
Information GainInfoGain(Y|X)
How many bits can we save if know X?
InfoGain(Y|X) = H(Y) – H(Y|X)
(equivalent to InfoGain(Y,X))
53
Information GainInfoGain(S,A): expected reduction in entropy due to A
Select A with max InfoGainResulting in lowest average entropy
54
Computing Average Entropy
Disorder of class distribution on branch i
Fraction of samples down branch i
|S| instances
Branch1 Branch 2
Sa1 a Sa1 b
Sa2 aSa2 b
55
Entropy in Sunburn Example
Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454
Height = 0.954 - 0.69= 0.264Weight = 0.954 - 0.94= 0.014Lotion = 0.954 - 0.61= 0.344
S = [3B,5N]
56
Entropy in Sunburn Example
Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1
S=[2B,2N]
57
Building Decision Trees with Information Gain
Until there are no inhomogeneous leavesSelect an inhomogeneous leaf nodeReplace that leaf node by a test node creating
subsets that yield highest information gain
Effectively creates set of rectangular regionsRepeatedly draws lines in different axes
58
Alternate MeasuresIssue with Information Gain:
Favors features with more values
Option:Gain Ratio
Sa : elements of S with value A=a
59
OverfittingOverfitting:
Model fits the training data TOO wellFits noise, irrelevant details
Why is this bad? Harms generalization Fits training data too well, fits new data badly
For model m, training_error(m), D_error(m) – D=all data
If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’)
60
Avoiding OverfittingStrategies to avoid overfitting:
Early stopping:Stop when InfoGain < thresholdStop when number of instances < thresholdStop when tree depth > threshold
Post-pruningGrow full tree and remove branches
Which is better?Unclear, both used.For some applications, post-pruning better
61
Post-PruningDivide data into
Training set: used to build the original treeValidation set: used to perform pruning
Build decision tree based on training data
Until pruning does not reduce validation set performanceCompute perf. for pruning each nodes (& its children)Greedily remove nodes that do not reduce VS performance
Yields smaller tree with best performance
62
Performance MeasuresCompute accuracy on:
Validation setk-fold cross-validation
Weighted classification error cost:Weight some types of errors more heavily
Minimum description length:Favor good accuracy on compact modelsMDL = error(tree) + model_size(tree)
63
Rule Post-PruningConvert tree to rules
Prune rules independently
Sort final rule set
Probably most widely used method (toolkits)
64
Modeling FeaturesDifferent types of features need different tests
Binary: Test branches on true/false
Discrete: Branches for each discrete value
Continuous? Need to discretizeEnumerate all values not possible or desirablePick value x
Branches: value < x; value >= xHow can we pick split points?
65
Picking Splits Need split useful, sufficient split points
What’s a good strategy?
Approach:Sort all values for the feature in training data Identify adjacent instances of different classes
Candidate split points between those instancesSelect candidate with highest information gain
66
Features in Decision Trees: Pros
Feature selection:Tests features that yield low disorder
E.g. selects features that are important!Ignores irrelevant features
Feature type handling:Discrete type: 1 branch per valueContinuous type: Branch on >= value
Absent features: Distribute uniformly
67
Features in Decision Trees: Cons
Features Assumed independentIf want group effect, must model explicitly
E.g. make new feature AorB
Feature tests conjunctive
68
Decision TreesTrain:
Build tree by forming subsets of least disorder
Predict:Traverse tree based on feature testsAssign leaf node sample label
Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading
Cons: Poor feature combination, dependency, optimal tree build intractable
Recommended