Upload
antony-allen
View
230
Download
0
Tags:
Embed Size (px)
Citation preview
Data Mining – Engineering Input and Output
Chapter 7
Performance Estimation and Parameter Tuning
• If you compare approaches using 10-fold cross validation, and vary parameters to see which comes out best …– You are performing human-assisted machine
learning and essentially peeking at the test data as part of your learniing
– Performance by best approach probably over-estimates performance on a completely new, never before seen set of data
Engineering Input
• Attribute selection
• Attribute discretization
• The test during learning cannot be done on training data or test data
• Data cleansing
• Creation of new “synthetic” attributes – E.g. combination of two or more attributes
Attribute Selection
• Some attributes are irrelevant – adding irrelevant attributes “distract” or “confuse” machine learning schemes– Divide and conquer approaches end up at some point dealing
with a small number of instances, where coincidences with irrelevant attributes may seem significant
– Same with instance based approaches
– Naïve Bayes is immune since it looks at all instances and all attributes and assumes independence – perfect for an irrelevant attribute – since it IS independent
Attribute Selection
• Some attributes are redundant with other attributes – Leads Naïve Bayes and linear regression astray
Attribute Selection
• Removing irrelevant or redundant attributes – Increases performance (not necessarily dramatically)– Speeds learning– Likely results in a simpler model (e.g. smaller
decision tree; fewer or shorter rules)
Attribute Selection• Best approach – select relevant attributes manually,
using human knowledge and experience (assuming that you are not RESEARCHING automatic attribute selection)
• Much research has addressed attribute selection
• WEKA has several approaches supported
• We will discuss several approaches
Filter vs Wrapper
• Two fundamentally different approaches
• Filter – based on analysis of data independent of any learning algorithm to be used – data set is filtered in advance
• Wrapper – evaluate which attributes should be used using the machine learning algorithm that will be used – the learning method is wrapped inside the selection procedure
Attribute Selection Filters• Use a (presumably different) machine learning algorithm
– E.g. use a decision tree learner; any attribute used at any part of the tree will be kept in the data set to be learned from
• Choose subset of attributes sufficient to divide all instances– This represents a bias toward consistency of training data– This may not be true (may be noise)– May result in overfitting
• Can examine all instances in an instance-based manner, comparing “near-misses” to each other and “near-hits”– An attribute with a different value in near hits may be irrelevant– An attribute with a different value in near misses may be important– Tallies of each for each attribute are summed– Highest scoring attributes are selected– Problem: doesn’t deal with redundant attributes – will both be in or
both be out
Searching the Attribute Space• Most filter approaches involve searching the space
of attributes for the subset that is most likely to predict the class best
• See next slide – shows the space of possible attribute subsets for the weather data
Figure 7.1 Attribute space for the weather dataset.
Searching the Attribute Space• Any of the bubbles could represent the best subset of attributes
• The number of possible subsets is exponential in the number of attributes
• Cannot do brute-force search except on VERY simple problems
• Common to search space starting either from top or from the bottom– Systematic – changes by moving on an arc
– Forward Selection – move down adding one attribute to subset
– Backward Elimination – move up, removing one attribute from subset
• Search proceeds “greedily” – always going in the same direction, never back-tracking
• Subsets are evaluated some way, search proceeds until no improvement can be found (this may be a “local maximum”)– Evaluation may be via correlation with attribute to be predicted or other
methods
Sketch of Algorithm
• BestSoFar = Current = Starting Point (top or bottom)• BestEval = Evaluation = Eval(current)• Repeat
– newMax = 0 ;
– loop through possible moves (over arcs)• Follow arc to set current
• Eval(current)
• Update newMax if appropriate, along with newBest
– Update BestSoFar and BestEval if newMax > BestEval
• Until no improvement during inner loop
Forward Selection vs Backward Elimination
• Forward selection tends to produce smaller subsets– Why? - Evaluation is only an estimate of value, a single
optimistic evaluation can lead to premature stopping (forward – a bit too small; backward – a bit too large a subset)
– Good if concerned about understandability – learning will produce a simpler concept description
• Backward elimination tends to produce greater performance in learner
Improvements• May introduce bias toward small subsets
– E.g. during forward selection, move down required to provide substantial better evaluation instead of just any improvement
• Bi-directional search – at time of move, consider moves in either direction
• Best First Search – keeps list of all subsets evaluated so far, sorted in order by eval, moves forward at each stage from highest rated node that hasn’t been already been looked forward from. At stopping time (if no stopping time/criteria, will be exhaustive search), highest rated is chosen
• Beam search – similar to best first, but only keeps N subsets at each stage (N is a fixed “beam-width”)
• Genetic algorithms – natural selection – random mutations of a current list of candidate subsets are evaluated, best kept
Wrappers
• May use forward election or backward elimination• Evaluation step is via performance of the learning algorithm
(on validation dataset)(preferably measured using 10-fold cross-validation)
• Has been successful in some cases, (not in others)• Very costly; run 10-fold cross validation, many times• Selective Naïve Bayes – forward selection, evaluated by
performance on training data (doubly naïve, but has been successful in practice)
7.2 Attribute Discretization
• Some algorithms cannot handle numeric attributes• Some algorithms may perform better without numeric
attributes• Some algorithms may be more efficient without
numeric attributes
• One approach was discussed when discussing OneR• NOTE – it may be useful to go part way – to ordered
categories
Discretization while Preserving Order, Compatible with Algorithms that only handle Nominal Data
Value 1 3 5 8 9 10 11 13 15 16 19 20 23 24 26
category 1-5
1-5
1-5
8-11
8-11
8-11 8-11 13-16 13-16
13-16
19-20
19-20
23-26
23-26
23-26
<=5 Y Y Y N N N N N N N N N N N N
<=11 Y Y Y Y Y Y Y N N N N N N N N
<=16 Y Y Y Y Y Y Y Y Y Y N N N N N
<=20 Y Y Y Y Y Y Y Y Y Y Y Y N N N
•Can get split between any categories with say rules such as
•IF ‘<=11’ = ‘Y’ Then …
•This technique can be used after any discretizing method has created categories
Supervised vs Unsupervised
• OneR’s discretization is supervised – class is considered
• Unsupervised – only values for the attribute being discretized are considered.
• Supervised can be beneficial because divisions may later help the learning method – by providing an attribute that helps to divide classes
Unsupervised
• Equal-Interval Binning - Divide range on the attribute into N subranges (based on how many categories (or bins) are desired)
• Equal-Frequency Binning – divide range into N subranges such that each subrange has the same number of instances in it
Equal Interval vs Equal Frequency
Value 1 3 5 8 9 10 11 13 15 16 19 20 23 24 26
Equal interval
1 1 1 2 2 2 2 3 3 3 4 4 5 5 5
Equal Frequency
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
• Equal interval bins are 1-6, 6-11, 11-16, 16-21, 21-26
•Book favors equal frequency; I favor equal interval
• Equal interval may have unbalanced distribution into bins
• But I argue that this retains the distribution of actual data, losing less info
Equal Interval vs Equal Frequency
• in EITHER approach, cutoffs may be arbitrary
•E.g. between 55.6 and 57.5 (both Yes by the way) in BOTH approaches (book says equal interval may be arbitrary, BOTH may be)
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Equal interval
-inf-43.1
-inf-43.1
-inf-43.1
-inf-43.1
-inf-43.1
43.1-56.2
56.2-69.2
56.2-69.2
56.2-69.2
69.2-82.3
69.2-82.3
82.3-inf
82.3-inf
82.3-inf
82.3-inf
Equal frequency
-inf-39.0
-inf-39.0
-inf-39.0
39.0-56.55
39.0-56.55
39.0-56.55
56.55-73.45
56.55-73.45
56.55-73.45
73.45-84.35
73.45-84.35
73.45-84.35
84.35-inf
84.35-inf
84.35-inf
Equal Interval vs Equal Frequency – Extreme Case
Value 1 2 3 4 5 6 8 11 15 20 26 33 41 50 60
Equal interval
1 1 1 1 1 1 1 1 2 2 3 3 4 5 5
Equal Frequency
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
• equal interval bins are 1-12,12-24,24-36,36-48,48-60
•equal frequency makes fine distinctions among data that is close together (e.g. 1-6) and ignores big differences among data that is far apart (e.g. 41-60)
• Is 11 more like 5 than it is like 33 ? (I think in general it is)
Not in Book - Other Possibilities for Unsupervised
• Clustering
• Gap finding
K-Means Clustering (Fancy Binning)
• Designed for “Smoothing” data, rather than discretization• Method:
– Sort Values
– Divide distinct values into number of Bins desired
– Compute total distance from bin means
– While can improve distance• loop through values
– if value closer to “neighbor bin mean” than own, move it
• compute new distance
• If nominal values needed, convert to categories
Example k-means• E.g. Humidity (sorted) - 15 values ==> 5 bins
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Mean 33.37 33.37 33.37 46.23 46.23 46.23 64.43 64.43 64.43 80.37 80.37 80.37 90.37 90.37 90.37
Diff 3.27 0.87 4.13 5.73 3.63 9.37 6.93 2.97 3.97 1.87 1.17 3.03 5.07 0.03 5.03
•Total Error 57.066673
•Consider moves: Move 55.6 to right; Move 85.3 to left
Example k-means (con)• New Clusters
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Mean 33.37 33.37 33.37 41.55 41.55 62.23 62.23 62.23 62.23 81.60 81.60 81.60 81.60 92.9 92.9
Diff 3.27 0.87 4.13 1.05 1.05 6.63 4.72 5.18 6.18 3.10 2.40 1.80 3.70 2.5 2.5
•Total Error 49.066673
•Consider moves: Move 37.5 to right
Example k-means (con)
• New Clusters
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Mean 31.3 31.3 40.2 40.2 40.2 62.23 62.23 62.23 62.23 81.60 81.60 81.60 81.60 92.9 92.9
Diff 1.20 1.20 2.70 0.30 2.40 6.63 4.72 5.18 6.18 3.10 2.40 1.80 3.70 2.5 2.5
Total Error 46.500008 •No Improvement Possible - use new means
Discretized by K-means
•
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
category
-inf-35.0
-inf-35.0
35.0-49.1
35.0-49.1
35.0-49.1
49.1-73.45
49.1-73.45
49.1-73.45
49.1-73.45
73.45-87.85
73.45-87.85
73.45-87.85
73.45-87.85
87.85-inf
87.85-inf
Gap Finding • If finding N bins, find N-1 biggest gapsVal 30.
132.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Gap after
2.4 5.0 3.0 2.1 13.0 1.9 9.9 1.0 10.1 0.7 4.2 1.9 5.1 5.0
Gap Rank
1 3 2 4
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Gap after
2.4 5.0 3.0 2.1 13.0 1.9 9.9 1.0 10.1 0.7 4.2 1.9 5.1 5.0
Gap Rank
1 3 2 4
Category
-inf-49.1
-inf-49.1
-inf-49.1
-inf-49.1
-inf-49.1
49.1-62.45
49.1-62.45
62.45-73.45
62.45-73.45
73.45- 87.85
73.45- 87.85
73.45- 87.85
73.45- 87.85
87.85-inf
87.85-inf
• As with the OneR scheme, we might want to limit smallest size bin (to at least > 1 instances)
Supervised• OneR’s method
• Entropy-based discretization
• (Not In Book) Class Entropy Binning
OneR’s Method with B=3
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Class
Y Y Y Y N Y Y N Y N N Y N N N
Category
-inf-41.55
-inf-41.55
-inf-41.55
-inf-41.55
41.55-73.45
41.55-73.45
41.55-73.45
41.55-73.45
41.55-73.45
73.45-inf
73.45-inf
73.45-inf
73.45-inf
73.45-inf
73.45-inf
• Don’t get 5 categories due to requirement that a category have at least 3 of the majority class
Entropy-Based Discretization
• Consider each possible dividing place, calculate entropy for each
• Find smallest entropy, put divider halfway between values on each side of split
• Unless stop condition reached, recursively call on top range
• Unless stop condition reached, recursively call on lower range
Entropy-Based Discretization Example• Possible Dividing Places Shown with orange lines below
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Class
Y Y Y Y N Y Y N Y N N Y N N N
• Entropy Calculations shown in entropy discretization spreadsheet
First Divisiony n total info value
all 8 7 15 0.996792 y n total total inst entropy<41.55 4 0 4 0 > 4 7 11 0.94566 15 0.693484<49.1 4 1 5 0.721928 > 4 6 10 0.970951 15 0.887943<62.45 6 1 7 0.591673 > 2 6 8 0.811278 15 0.708796<67.9 6 2 8 0.811278 > 2 5 7 0.863121 15 0.835471<73.45 7 2 9 0.764205 > 1 5 6 0.650022 15 0.718532<81.3 7 4 11 0.94566 > 1 3 4 0.811278 15 0.909825<84.35 8 4 12 0.918296 > 0 3 3 0 15 0.734637
•Best is < 41.55 after first 4 Yes instances
•Looking ahead, lower range is pure, so stopping condition will surely be met, upper range still needs to be divided
Example Continued• Possible Dividing Places Shown with orange lines below
Val 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Class
N Y Y N Y N N Y N N N
• Entropy Calculations shown in entropy discretization spreadsheet
Second Division
y n total info value>41.55 4 7 11 0.94566 y n total total inst entropy<49.1 0 1 1 0 > 4 6 10 0.970951 11 0.882682<62.45 2 1 3 0.918296 > 2 6 8 0.811278 11 0.840465<67.9 2 2 4 1 > 2 5 7 0.863121 11 0.912895<73.45 3 2 5 0.970951 > 1 5 6 0.650022 11 0.795899<81.3 3 4 7 0.985228 > 1 3 4 0.811278 11 0.921974<84.35 4 4 8 1 > 0 3 3 0 11 0.727273
•Best is > 84.35 before last 3 No instances
•Upper range is pure, so stopping condition will surely be met, lower range still needs to be divided
Example Continued• Possible Dividing Places Shown with orange lines below
Val 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4
Class
N Y Y N Y N N Y
• Entropy Calculations shown in entropy discretization spreadsheet
Third Divisiony n total info value
<84.35 (all) 4 4 8 1 y n total total inst entropy<49.1 0 1 1 0 > 4 3 7 0.985228 8 0.862075<62.45 2 1 3 0.918296 > 2 3 5 0.970951 8 0.951205<67.9 2 2 4 1 > 2 2 4 1 8 1<73.45 3 2 5 0.970951 > 1 2 3 0.918296 8 0.951205<81.3 3 4 7 0.985228 > 1 0 1 0 8 0.862075
•Not getting good divisions here
•Best is tie between splitting off first No or Last Yes let’s arbitrarily take the upper Yes (the gain is so low here, we might even skip taking this split and stop here – see discussion of minimum description length principle)
•Upper range is pure, so stopping condition will surely be met, lower range still needs to be divided
Example Continued• Possible Dividing Places Shown with orange lines below
Val 42.6 55.6 57.5 67.4 68.4 78.5 79.2
Class
N Y Y N Y N N
• Entropy Calculations shown in entropy discretization spreadsheet
Fourth Divisiony n total info value
41.55-81.3 3 4 7 0.985228 y n total total inst entropy<49.1 0 1 1 0 > 3 3 6 1 7 0.857143<62.45 2 1 3 0.918296 > 1 3 4 0.811278 7 0.857143<67.9 2 2 4 1 > 1 2 3 0.918296 7 0.964984<73.45 3 2 5 0.970951 > 0 2 2 0 7 0.693536
•Best is > 73.45 before last 2 No instances
•Upper range is pure, so stopping condition will surely be met, lower range still needs to be divided
•Well, if our stopping condition includes how many bins we have, and we are looking for 5, we would stop here
Let’s Say Stopping Condition Hit with 5 Categories
Val 30.1 32.5 37.5 40.5 42.6 55.6 57.5 67.4 68.4 78.5 79.2 83.4 85.3 90.4 95.4
Class
Y Y Y Y N Y Y N Y N N Y N N N
Category
-inf-41.55
-inf-41.55
-inf-41.55
-inf-41.55
41.55-73.45
41.55-73.45
41.55-73.45
41.55-73.45
41.55-73.45
73.45-81.3
73.45-81.3
81.3-84.35
84.35-inf
84.35-inf
84.35-inf
Combining Multiple Models
• When making decisions, it can be valuable to take into account more than one opinion
• In data mining, can combine the predictions of multiple models
• Generally improves performance • Common methods: bagging, boosting, stacking,
error-correcting codes • Negative - Makes “resulting” “model” harder
for people to understand
Bagging and Boosting• Take votes of learned models
• (For numeric prediction, take average)
• Bagging – equal votes / averages
• Boosting – weighted votes / averages – weighted by model’s performance
• Another significant difference – – Bagging involves learning separate models (could even
be parallel)– Boosting involves iterative generation of models
Bagging• Several training datasets are chosen at random
– Datasets generated using bootstrap method (Section 5.4)• Sample with replacement
• Training is carried out on each, producing models• Test instances are predicted by using all models
generated and having them vote for their prediction
• Bagging produces a combined model that often outperforms a single model built from the original training data
Figure 7.6 Algorithm for bagging.
model generationLet n be the number of instances in the training data.For each of t iterations: Sample n instances with replacement from training data. Apply the learning algorithm to the sample. Store the resulting model.
classificationFor each of the t models: Predict class of instance using model.Return class that has been predicted most often.
• if doing numeric prediction, average predictions, instead of voting
Bagging Critique
• Beneficial if– Learning algorithm IS NOT stable – differences in
data will lead to different models (OneR, Linear regression might not be good candidates)
– Models learned have pretty good performance – combining advice from a number of models each of which is wrong most of the time will lead us to be wrong!
– Ideally if the different models do well on different parts of the dataset
Boosting
• AdaBoost.M1 is the algorithm described– Assumes classification task– Assumes learner can handle weighted instances– Error = sum of weights of misclassified instances /
total weights of all instances
• By weighting instances, learner is led to focus on instances with high weights – greater incentive to get them right
Figure 7.7 Algorithm for boosting.model generationAssign equal weight to each training instance.For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model. Compute error e of model on weighted dataset and store error. If e equal to zero, or e greater or equal to 0.5: Terminate model generation. For each instance in dataset: If instance classified correctly by model: Multiply weight of instance by e / (1 - e). Normalize weight of all instances.
classificationAssign weight of zero to all classes.For each of the t (or less) models: Add -log(e / (1 - e)) to weight of class predicted by model.Return class with highest weight.
What if the learning algorithm doesn’t handle weighted instances?
• Get same effect by re-sampling – instances that are incorrectly predicted are chosen with higher probability – likely in training dataset more than once
• Disadvantage – some low weight instances will not be sampled and will lose any influence
• On other hand, if error gets up to .5, with resampling, can toss that model and try again after generating another sample
Boosting Critique
• “If boosting does succeed in reducing the error on fresh data, it often does so in a spectacular way”
• Further boosting after zero error is reached can continue increeasing performance on unseen data
• Powerful combined classifiers can be built from very simple ones (even as simple as OneR)
• Boosting often significantly better than bagging, but in some practical situations, boosting hurts – does worse than single classifier on unseen data – may overfit data
Stacking
• Less used than bagging and boosting– No generally agreed upon method
• Unlike bagging and boosting combines different types of models instead of same types
• Stacking tries to learn which classifiers are reliable models – using another learning algorithm – the meta learner – to discover how to best combine the output of the original (“base”) learners
• “Linear Models” have been found to be good meta-learners in this scheme
• Details will be skipped (p258-260)
Error-Correcting Output Codes• SKIP
End Chapter 7