Upload
ishan-rastogi
View
221
Download
0
Embed Size (px)
Citation preview
8/2/2019 Over Fitting and Tbl
1/46
Over fitting
&Transformation Based Learning
CS 371: Spring 2012
8/2/2019 Over Fitting and Tbl
2/46
Machine Learning
Machines can learn from examples
Learning modifies the agent's decision mechanisms to improveperformance
Given training data, machines analyze the data, and learnrules which generalize to new examples
Can be sub-symbolic (rule may be a mathematical function) Or it can be symbolic (rules are in a representation that is similar
to representation used for hand-coded rules)
In general, machine learning approaches allow for more tuningto the needs of a corpus, and can be reused across corpora
8/2/2019 Over Fitting and Tbl
3/46
Training data example
Inductive learningEmpirical error function:
E(h) =Sx
distance[h(x; q) , f]
Empirical learning = finding h(x), or h(x; q) that minimizes E(h)
Note an implicit assumption: For any set of attribute values there is a unique target value
This in effect assumes a no-noise mapping from inputs to targets
This is often not true in practice (e.g., in medicine).
8/2/2019 Over Fitting and Tbl
4/46
Learning Boolean Functions
Given examples of the function, can we learn the function?
2 to the power of 2d different Boolean functions can be defined on dattributes
This is the size of our hypothesis space
Observations:
Huge hypothesis spaces > directly searching over all functions is impossible Given a small data (n pairs) our learning problem may be underconstrained
Ockhams razor: if multiple candidate functions all explain the dataequally well, pick the simplest explanation (least complex function)
Constrain our search to classes of Boolean functions, e.g.,
decision trees
8/2/2019 Over Fitting and Tbl
5/46
Decision Tree Learning
Constrain h(..) to be a decision tree
8/2/2019 Over Fitting and Tbl
6/46
Pseudocode for Decision tree learning
8/2/2019 Over Fitting and Tbl
7/46
Major issues
Q1: Choosing best attribute: what quality measure to use?
Q2: Handling training data with missing attribute values
Q3: Handling training data with noise, irrelevant attributes
- Determining when to stop splitting: avoid overfitting
8/2/2019 Over Fitting and Tbl
8/46
Major issues
Q1: Choosing best attribute: different quality measures.
Information gain, gain ratio
Q2: Handling training data with missing attribute values: blankvalue, most common value, or fractional count
Q3: Handling training data with noise, irrelevant attributes:- Determining when to stop splitting: ????
8/2/2019 Over Fitting and Tbl
9/46
Assessing Performance
Training data performance is typically optimistic
e.g., error rate on training data
Reasons?
- classifier may not have enough data to fully learn the concept (but
on training data we dont know this)
- for noisy data, the classifier may overfit the training data
In practice we want to assess performance out of sample
how well will the classifier do on new unseen data? This is the
true test of what we have learned (just like a classroom)
With large data sets we can partition our data into 2 subsets, train and test
- build a model on the training data
- assess performance on the test data
8/2/2019 Over Fitting and Tbl
10/46
Example of Test Performance
Restaurant problem
- simulate 100 data sets of different sizes- train on this data, and assess performance on an independent test set
- learning curve = plotting accuracy as a function of training set size
- typical diminishing returns effect
8/2/2019 Over Fitting and Tbl
11/46
Example
8/2/2019 Over Fitting and Tbl
12/46
Example
8/2/2019 Over Fitting and Tbl
13/46
Example
8/2/2019 Over Fitting and Tbl
14/46
Example
8/2/2019 Over Fitting and Tbl
15/46
Example
8/2/2019 Over Fitting and Tbl
16/46
How Overfitting affects Prediction
PredictiveError
Model Complexity
Error on Training Data
8/2/2019 Over Fitting and Tbl
17/46
How Overfitting affects Prediction
PredictiveError
Model Complexity
Error on Training Data
Error on Test Data
8/2/2019 Over Fitting and Tbl
18/46
How Overfitting affects Prediction
PredictiveError
Model Complexity
Error on Training Data
Error on Test Data
Ideal Rangefor Model Complexity
OverfittingUnderfitting
8/2/2019 Over Fitting and Tbl
19/46
Training and Validation Data
Full Data Set
Training Data
Validation Data
Idea: train each
model on the
training data
and then test
each models
accuracy on
the validation data
8/2/2019 Over Fitting and Tbl
20/46
The v-fold Cross-Validation Method
Why just choose one particular 90/10 split of the data?
In principle we could do this multiple times
v-fold Cross-Validation (e.g., v=10)
randomly partition our full data set into v disjoint subsets (eachroughly of size n/v, n = total number of training data points)
for i = 1:10 (here v = 10)
train on 90% of data,
Acc(i) = accuracy on other 10%
end
Cross-Validation-Accuracy = 1/v Si Acc(i)
choose the method with the highest cross-validation accuracy
common values for v are 5 and 10
Can also do leave-one-out where v = n
8/2/2019 Over Fitting and Tbl
21/46
Disjoint Validation Data Sets
Full Data Set
Training Data
Validation Data
1st partition
8/2/2019 Over Fitting and Tbl
22/46
Disjoint Validation Data Sets
Full Data Set
Training Data
Validation Data
Validation
Data
1st partition 2nd partition
8/2/2019 Over Fitting and Tbl
23/46
More on Cross-Validation
Notes
cross-validation generates an approximate estimate of how wellthe learned model will do on unseen data
by averaging over different partitions it is more robust than just asingle train/validate partition of the data
v-fold cross-validation is a generalization
partition data into disjoint validation subsets of size n/v
train, validate, and average over the v partitions
e.g., v=10 is commonly used
v-fold cross-validation is approximately v times computationallymore expensive than just fitting a model to all of the data
8/2/2019 Over Fitting and Tbl
24/46
Lets look at an other symbolic learner
8/2/2019 Over Fitting and Tbl
25/46
Problem Domain: POS Tagging
What is text tagging?
Some sort of markup, enabling understanding oflanguage.
Can be word tags:
He will race/VERB the car.
He will not race/VERB the truck.
When will the race/NOUN end?
8/2/2019 Over Fitting and Tbl
26/46
Why do we care?
Sometimes, meaning changes a lot Transcribed speech lacks clear punctuation:
I called, John and Mary are there.
I called John and Mary are there.
(I called John) and (Mary are there.) ??
I called ((John and Mary) are there.)
We can tell, but can a computer?
Here, needs to know about verb forms and collections
Can be important!
Quick! Wrap the bandage on the table around her leg!
Imagine a robotic medical assistant with this one . . .
8/2/2019 Over Fitting and Tbl
27/46
Where is this used?
Any natural language task!
Translators: word-by-word translation does not always work,sentences need re-arranging.
It can help with OCR or voice transcription
I need to writer. I'm a good write her.
to writer?? a good write?
I need to write her. I'm a good writer.
8/2/2019 Over Fitting and Tbl
28/46
Some terms
Corpus
Big body of text, annotated (expert-tagged) or notDictionary
List of known words, and all possible parts of speech
Lexical/Morphological vs. Contextual
Is it a word property (spelling) or surroundings (neighboring
parts of speech)?Semantics vs Syntax
Meaning (definition) vs. Structure (phrases, parsing)
Tokenizer
Separates text into words or other sized blocks (idioms,
phrases . . . )Disambiguator
Extra pass to reduce possible tags to a single one.
8/2/2019 Over Fitting and Tbl
29/46
Some problems we face
Classification challenges:
Large number of classes: English POS: varying tagsets, 48 to 195 tags
Often ambiguous, varying with use/context
POS: There must be a way to go there; I know a
person from there see that guy there?
(pron., adv., n.)
Varying number of relevant features
Spelling, position, surrounding words, paragraph
position, article topic . . .
8/2/2019 Over Fitting and Tbl
30/46
TBL: A Symbolic Learning Method
A method called error-driven Transformation-Based Learning
(TBL) (Brill algorithm) can be used for symbolic learning The rules (actually, a sequence of rules) are learned from an
annotated corpus
Performs about as accurately as other statistical approaches
Can have better treatment ofcontext compared to HMMs (aswell see)
rules which use the next (or previous) POS
HMMs just use P(Ti| Ti-1) or P(Ti| Ti-2 Ti-1)
rules which use the previous (next) word
HMMs just use P(Wi|Ti)
8/2/2019 Over Fitting and Tbl
31/46
What does it do?
Transformation-Based Error-Driven Learning:
First, a dictionary tags every word with its mostcommon POS. So, run is tagged as a verb in both:
The run lasted 30 minutes and We run 3 miles every day
Unknown capitalized words are assumed to be propernouns, and remaining unknown words are assigned the mostcommon tag for their three-letter ending.
blahblahous is probably an adjective.
Finally, the tags are updated by a set of patches, with theform Change tag a to b if:
The word is in context C (eg, the pattern of surrounding tags)
The word or one in a region Rhas lexical property P (eg,capitalization)
8/2/2019 Over Fitting and Tbl
32/46
Rule Templates
Brills method learns transformations which fit different
templates Template: Change tag X to tag Y when previous word is W
Transformation: NN VB when previous word = to
Change tag X to tag Y when previous tag is Z
Ex:
The can rusted.
The (determiner) can (modal verb) rusted (verb) . (.)
Transformation: Modal Noun when previous tag = DET
The (determiner) can (noun) rusted (verb) . (.)
Change tag X to tag Y when previous 1st, 2nd, or 3rd word is W VBP VB when one of previous 3 words = has
The learning process is guided by a small number of templates(e.g., 26) to learn specific rules from the corpus
Note how these rules sort of match linguistic intuition
8/2/2019 Over Fitting and Tbl
33/46
Brill Algorithm (Overview)
Assume you are given a
training corpus G (for goldstandard)
First, create a tag-free
version V of it then do
steps 1-4
Notes: As the algorithm
proceeds, each
successive rule coversfewer examples, but
potentially more
accurately
Some later rules may
change tags changedby earlier rules
1. Initial-state annotator:
Label every word tokenin V with most likely tagfor that word type fromG.
2. Consider every possibletransformational rule:
select the one that leadsto the mostimprovement in V usingG to measure the error
3. Retag V based on thisrule
4. Go back to 2, until thereis no significantimprovement in accuracyover previous iteration
8/2/2019 Over Fitting and Tbl
34/46
Error-driven method
How does one learn the rules? The TBL method is error-driven
The rule which is learned on a given iteration is the one whichreduces the error rate of the corpus the most, e.g.:
Rule 1 fixes 50 errors but introduces 25 more net decrease is 25
Rule 2 fixes 45 errors but introduces 15 more net decrease is 30
Choose rule 2 in this case
We set a stopping criterion, or threshold once we stop
reducing the error rate by a big enough margin, learning isstopped
8/2/2019 Over Fitting and Tbl
35/46
Example of Error Reduction
From Eric Brill (1995):Computational Linguistics, 21, 4, p. 7
8/2/2019 Over Fitting and Tbl
36/46
Rule ordering
One rule is learned with every pass through the corpus.
The set of final rules is what the final output is Unlike HMMs, such a representation allows a linguist to look
through and make more sense of the rules
Thus, the rules are learned iteratively and must be applied inan iterative fashion.
At one stage, it may make sense to change NN to VB after to
But at a later stage, it may make sense to change VB back to NNin the same context, e.g., if the current word is school
8/2/2019 Over Fitting and Tbl
37/46
Example of Learned Rule Sequence
1. NN VB PREVTAG TO
to/TO race/NN->VB
2. VBP VB PREV1OR20R3TAG MD
might/MD vanish/VBP-> VB
3. NN VB PREV1OR2TAG MD
might/MD not/RB reply/NN -> VB
4. VB NN PREV1OR2TAG DT
the/DT great/JJ feast/VB->NN
5. VBD VBN PREV1OR20R3TAG VBZ
He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP
8/2/2019 Over Fitting and Tbl
38/46
Insights on TBL
TBL takes a long time to train, but is relatively fast at tagging
once the rules are learned The rules in the sequence may be decomposed into non-
interacting subsets, i.e., only focus on VB tagging (need toonly look at rules which affect it)
In cases where the data is sparse, the initial guess needs to beweak enough to allow for learning
Rules become increasingly specific as you go down thesequence.
However, the more specific rules generally dont overfit becausethey cover just a few cases
8/2/2019 Over Fitting and Tbl
39/46
Relation between DT and TBL
8/2/2019 Over Fitting and Tbl
40/46
DT and TBL
DT is a subset of TBL
1.Label with S2. If X then S A
3.S B
8/2/2019 Over Fitting and Tbl
41/46
DT is a proper subset of TBL
There exists a problem that can be solved by TBL but not a DT,
for a fixed set of primitive queries.
Ex: Given a sequence of characters
Classify a char based on its position
If pos % 4 == 0 then yes else no
Input attributes available: previous two chars
8/2/2019 Over Fitting and Tbl
42/46
Transformation list:
Label with S: A/S A/S A/S A/S A/S A/S A/S
If there is no previous character, then S F
A/F A/S A/S A/S A/S A/S A/S
If the char two to the left is labeled with F, then S F
A/F A/S A/F A/S A/F A/S A/F
If the char two to the left is labeled with F, then FS
A/F A/S A/S A/S A/F A/S A/S
F yes
S no
8/2/2019 Over Fitting and Tbl
43/46
DT and TBL
TBL is more powerful than DT
Extra power of TBL comes from
Transformations are applied in sequence
Results of previous transformations are visible to followingtransformations.
8/2/2019 Over Fitting and Tbl
44/46
8/2/2019 Over Fitting and Tbl
45/46
Brill Algorithm (More Detailed)
1. Label every word token with its most
likely tag (based on lexical generation
probabilities).
2. List the positions of tagging errors and
their counts, by comparing with truth (T)
3. For each error position, consider each
instantiation I of X, Y, and Z in Rule
template.
If Y=T, increment improvements[I],else increment errors[I].
4. Pick the I which results in the greatest
error reduction, and add to output
VB NN PREV1OR2TAG DT improves
on 98 errors, but produces 18 new
errors, so net decrease of 80 errors
5. Apply that I to corpus
6. Go to 2, unless stopping criterion is
reached
Most likely tag:
P(NN|race) = .98
P(VB|race) = .02
Is/VBZ expected/VBN to/TO
race/NNtomorrow/NN
Rule template: Change a word fromtag X to tag Y when previous tag is
Z
Rule Instantiation for above example:
NN VB PREV1OR2TAG TO
Applying this rule yields:
Is/VBZ expected/VBN to/TO
race/VB tomorrow/NN
8/2/2019 Over Fitting and Tbl
46/46
Handling Unknown Words
Can also use the Brill
method to learn how to tagunknown words
Instead of using surroundingwords and tags, use affixinfo, capitalization, etc.
Guess NNP if capitalized,
NN otherwise.
Or use the tag mostcommon for wordsending in the last 3letters.
etc. TBL has also been applied to
some parsing tasks
Example Learned Rule Sequence
for Unknown Words