Over Fitting and Tbl

Embed Size (px)

Citation preview

  • 8/2/2019 Over Fitting and Tbl

    1/46

    Over fitting

    &Transformation Based Learning

    CS 371: Spring 2012

  • 8/2/2019 Over Fitting and Tbl

    2/46

    Machine Learning

    Machines can learn from examples

    Learning modifies the agent's decision mechanisms to improveperformance

    Given training data, machines analyze the data, and learnrules which generalize to new examples

    Can be sub-symbolic (rule may be a mathematical function) Or it can be symbolic (rules are in a representation that is similar

    to representation used for hand-coded rules)

    In general, machine learning approaches allow for more tuningto the needs of a corpus, and can be reused across corpora

  • 8/2/2019 Over Fitting and Tbl

    3/46

    Training data example

    Inductive learningEmpirical error function:

    E(h) =Sx

    distance[h(x; q) , f]

    Empirical learning = finding h(x), or h(x; q) that minimizes E(h)

    Note an implicit assumption: For any set of attribute values there is a unique target value

    This in effect assumes a no-noise mapping from inputs to targets

    This is often not true in practice (e.g., in medicine).

  • 8/2/2019 Over Fitting and Tbl

    4/46

    Learning Boolean Functions

    Given examples of the function, can we learn the function?

    2 to the power of 2d different Boolean functions can be defined on dattributes

    This is the size of our hypothesis space

    Observations:

    Huge hypothesis spaces > directly searching over all functions is impossible Given a small data (n pairs) our learning problem may be underconstrained

    Ockhams razor: if multiple candidate functions all explain the dataequally well, pick the simplest explanation (least complex function)

    Constrain our search to classes of Boolean functions, e.g.,

    decision trees

  • 8/2/2019 Over Fitting and Tbl

    5/46

    Decision Tree Learning

    Constrain h(..) to be a decision tree

  • 8/2/2019 Over Fitting and Tbl

    6/46

    Pseudocode for Decision tree learning

  • 8/2/2019 Over Fitting and Tbl

    7/46

    Major issues

    Q1: Choosing best attribute: what quality measure to use?

    Q2: Handling training data with missing attribute values

    Q3: Handling training data with noise, irrelevant attributes

    - Determining when to stop splitting: avoid overfitting

  • 8/2/2019 Over Fitting and Tbl

    8/46

    Major issues

    Q1: Choosing best attribute: different quality measures.

    Information gain, gain ratio

    Q2: Handling training data with missing attribute values: blankvalue, most common value, or fractional count

    Q3: Handling training data with noise, irrelevant attributes:- Determining when to stop splitting: ????

  • 8/2/2019 Over Fitting and Tbl

    9/46

    Assessing Performance

    Training data performance is typically optimistic

    e.g., error rate on training data

    Reasons?

    - classifier may not have enough data to fully learn the concept (but

    on training data we dont know this)

    - for noisy data, the classifier may overfit the training data

    In practice we want to assess performance out of sample

    how well will the classifier do on new unseen data? This is the

    true test of what we have learned (just like a classroom)

    With large data sets we can partition our data into 2 subsets, train and test

    - build a model on the training data

    - assess performance on the test data

  • 8/2/2019 Over Fitting and Tbl

    10/46

    Example of Test Performance

    Restaurant problem

    - simulate 100 data sets of different sizes- train on this data, and assess performance on an independent test set

    - learning curve = plotting accuracy as a function of training set size

    - typical diminishing returns effect

  • 8/2/2019 Over Fitting and Tbl

    11/46

    Example

  • 8/2/2019 Over Fitting and Tbl

    12/46

    Example

  • 8/2/2019 Over Fitting and Tbl

    13/46

    Example

  • 8/2/2019 Over Fitting and Tbl

    14/46

    Example

  • 8/2/2019 Over Fitting and Tbl

    15/46

    Example

  • 8/2/2019 Over Fitting and Tbl

    16/46

    How Overfitting affects Prediction

    PredictiveError

    Model Complexity

    Error on Training Data

  • 8/2/2019 Over Fitting and Tbl

    17/46

    How Overfitting affects Prediction

    PredictiveError

    Model Complexity

    Error on Training Data

    Error on Test Data

  • 8/2/2019 Over Fitting and Tbl

    18/46

    How Overfitting affects Prediction

    PredictiveError

    Model Complexity

    Error on Training Data

    Error on Test Data

    Ideal Rangefor Model Complexity

    OverfittingUnderfitting

  • 8/2/2019 Over Fitting and Tbl

    19/46

    Training and Validation Data

    Full Data Set

    Training Data

    Validation Data

    Idea: train each

    model on the

    training data

    and then test

    each models

    accuracy on

    the validation data

  • 8/2/2019 Over Fitting and Tbl

    20/46

    The v-fold Cross-Validation Method

    Why just choose one particular 90/10 split of the data?

    In principle we could do this multiple times

    v-fold Cross-Validation (e.g., v=10)

    randomly partition our full data set into v disjoint subsets (eachroughly of size n/v, n = total number of training data points)

    for i = 1:10 (here v = 10)

    train on 90% of data,

    Acc(i) = accuracy on other 10%

    end

    Cross-Validation-Accuracy = 1/v Si Acc(i)

    choose the method with the highest cross-validation accuracy

    common values for v are 5 and 10

    Can also do leave-one-out where v = n

  • 8/2/2019 Over Fitting and Tbl

    21/46

    Disjoint Validation Data Sets

    Full Data Set

    Training Data

    Validation Data

    1st partition

  • 8/2/2019 Over Fitting and Tbl

    22/46

    Disjoint Validation Data Sets

    Full Data Set

    Training Data

    Validation Data

    Validation

    Data

    1st partition 2nd partition

  • 8/2/2019 Over Fitting and Tbl

    23/46

    More on Cross-Validation

    Notes

    cross-validation generates an approximate estimate of how wellthe learned model will do on unseen data

    by averaging over different partitions it is more robust than just asingle train/validate partition of the data

    v-fold cross-validation is a generalization

    partition data into disjoint validation subsets of size n/v

    train, validate, and average over the v partitions

    e.g., v=10 is commonly used

    v-fold cross-validation is approximately v times computationallymore expensive than just fitting a model to all of the data

  • 8/2/2019 Over Fitting and Tbl

    24/46

    Lets look at an other symbolic learner

  • 8/2/2019 Over Fitting and Tbl

    25/46

    Problem Domain: POS Tagging

    What is text tagging?

    Some sort of markup, enabling understanding oflanguage.

    Can be word tags:

    He will race/VERB the car.

    He will not race/VERB the truck.

    When will the race/NOUN end?

  • 8/2/2019 Over Fitting and Tbl

    26/46

    Why do we care?

    Sometimes, meaning changes a lot Transcribed speech lacks clear punctuation:

    I called, John and Mary are there.

    I called John and Mary are there.

    (I called John) and (Mary are there.) ??

    I called ((John and Mary) are there.)

    We can tell, but can a computer?

    Here, needs to know about verb forms and collections

    Can be important!

    Quick! Wrap the bandage on the table around her leg!

    Imagine a robotic medical assistant with this one . . .

  • 8/2/2019 Over Fitting and Tbl

    27/46

    Where is this used?

    Any natural language task!

    Translators: word-by-word translation does not always work,sentences need re-arranging.

    It can help with OCR or voice transcription

    I need to writer. I'm a good write her.

    to writer?? a good write?

    I need to write her. I'm a good writer.

  • 8/2/2019 Over Fitting and Tbl

    28/46

    Some terms

    Corpus

    Big body of text, annotated (expert-tagged) or notDictionary

    List of known words, and all possible parts of speech

    Lexical/Morphological vs. Contextual

    Is it a word property (spelling) or surroundings (neighboring

    parts of speech)?Semantics vs Syntax

    Meaning (definition) vs. Structure (phrases, parsing)

    Tokenizer

    Separates text into words or other sized blocks (idioms,

    phrases . . . )Disambiguator

    Extra pass to reduce possible tags to a single one.

  • 8/2/2019 Over Fitting and Tbl

    29/46

    Some problems we face

    Classification challenges:

    Large number of classes: English POS: varying tagsets, 48 to 195 tags

    Often ambiguous, varying with use/context

    POS: There must be a way to go there; I know a

    person from there see that guy there?

    (pron., adv., n.)

    Varying number of relevant features

    Spelling, position, surrounding words, paragraph

    position, article topic . . .

  • 8/2/2019 Over Fitting and Tbl

    30/46

    TBL: A Symbolic Learning Method

    A method called error-driven Transformation-Based Learning

    (TBL) (Brill algorithm) can be used for symbolic learning The rules (actually, a sequence of rules) are learned from an

    annotated corpus

    Performs about as accurately as other statistical approaches

    Can have better treatment ofcontext compared to HMMs (aswell see)

    rules which use the next (or previous) POS

    HMMs just use P(Ti| Ti-1) or P(Ti| Ti-2 Ti-1)

    rules which use the previous (next) word

    HMMs just use P(Wi|Ti)

  • 8/2/2019 Over Fitting and Tbl

    31/46

    What does it do?

    Transformation-Based Error-Driven Learning:

    First, a dictionary tags every word with its mostcommon POS. So, run is tagged as a verb in both:

    The run lasted 30 minutes and We run 3 miles every day

    Unknown capitalized words are assumed to be propernouns, and remaining unknown words are assigned the mostcommon tag for their three-letter ending.

    blahblahous is probably an adjective.

    Finally, the tags are updated by a set of patches, with theform Change tag a to b if:

    The word is in context C (eg, the pattern of surrounding tags)

    The word or one in a region Rhas lexical property P (eg,capitalization)

  • 8/2/2019 Over Fitting and Tbl

    32/46

    Rule Templates

    Brills method learns transformations which fit different

    templates Template: Change tag X to tag Y when previous word is W

    Transformation: NN VB when previous word = to

    Change tag X to tag Y when previous tag is Z

    Ex:

    The can rusted.

    The (determiner) can (modal verb) rusted (verb) . (.)

    Transformation: Modal Noun when previous tag = DET

    The (determiner) can (noun) rusted (verb) . (.)

    Change tag X to tag Y when previous 1st, 2nd, or 3rd word is W VBP VB when one of previous 3 words = has

    The learning process is guided by a small number of templates(e.g., 26) to learn specific rules from the corpus

    Note how these rules sort of match linguistic intuition

  • 8/2/2019 Over Fitting and Tbl

    33/46

    Brill Algorithm (Overview)

    Assume you are given a

    training corpus G (for goldstandard)

    First, create a tag-free

    version V of it then do

    steps 1-4

    Notes: As the algorithm

    proceeds, each

    successive rule coversfewer examples, but

    potentially more

    accurately

    Some later rules may

    change tags changedby earlier rules

    1. Initial-state annotator:

    Label every word tokenin V with most likely tagfor that word type fromG.

    2. Consider every possibletransformational rule:

    select the one that leadsto the mostimprovement in V usingG to measure the error

    3. Retag V based on thisrule

    4. Go back to 2, until thereis no significantimprovement in accuracyover previous iteration

  • 8/2/2019 Over Fitting and Tbl

    34/46

    Error-driven method

    How does one learn the rules? The TBL method is error-driven

    The rule which is learned on a given iteration is the one whichreduces the error rate of the corpus the most, e.g.:

    Rule 1 fixes 50 errors but introduces 25 more net decrease is 25

    Rule 2 fixes 45 errors but introduces 15 more net decrease is 30

    Choose rule 2 in this case

    We set a stopping criterion, or threshold once we stop

    reducing the error rate by a big enough margin, learning isstopped

  • 8/2/2019 Over Fitting and Tbl

    35/46

    Example of Error Reduction

    From Eric Brill (1995):Computational Linguistics, 21, 4, p. 7

  • 8/2/2019 Over Fitting and Tbl

    36/46

    Rule ordering

    One rule is learned with every pass through the corpus.

    The set of final rules is what the final output is Unlike HMMs, such a representation allows a linguist to look

    through and make more sense of the rules

    Thus, the rules are learned iteratively and must be applied inan iterative fashion.

    At one stage, it may make sense to change NN to VB after to

    But at a later stage, it may make sense to change VB back to NNin the same context, e.g., if the current word is school

  • 8/2/2019 Over Fitting and Tbl

    37/46

    Example of Learned Rule Sequence

    1. NN VB PREVTAG TO

    to/TO race/NN->VB

    2. VBP VB PREV1OR20R3TAG MD

    might/MD vanish/VBP-> VB

    3. NN VB PREV1OR2TAG MD

    might/MD not/RB reply/NN -> VB

    4. VB NN PREV1OR2TAG DT

    the/DT great/JJ feast/VB->NN

    5. VBD VBN PREV1OR20R3TAG VBZ

    He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP

  • 8/2/2019 Over Fitting and Tbl

    38/46

    Insights on TBL

    TBL takes a long time to train, but is relatively fast at tagging

    once the rules are learned The rules in the sequence may be decomposed into non-

    interacting subsets, i.e., only focus on VB tagging (need toonly look at rules which affect it)

    In cases where the data is sparse, the initial guess needs to beweak enough to allow for learning

    Rules become increasingly specific as you go down thesequence.

    However, the more specific rules generally dont overfit becausethey cover just a few cases

  • 8/2/2019 Over Fitting and Tbl

    39/46

    Relation between DT and TBL

  • 8/2/2019 Over Fitting and Tbl

    40/46

    DT and TBL

    DT is a subset of TBL

    1.Label with S2. If X then S A

    3.S B

  • 8/2/2019 Over Fitting and Tbl

    41/46

    DT is a proper subset of TBL

    There exists a problem that can be solved by TBL but not a DT,

    for a fixed set of primitive queries.

    Ex: Given a sequence of characters

    Classify a char based on its position

    If pos % 4 == 0 then yes else no

    Input attributes available: previous two chars

  • 8/2/2019 Over Fitting and Tbl

    42/46

    Transformation list:

    Label with S: A/S A/S A/S A/S A/S A/S A/S

    If there is no previous character, then S F

    A/F A/S A/S A/S A/S A/S A/S

    If the char two to the left is labeled with F, then S F

    A/F A/S A/F A/S A/F A/S A/F

    If the char two to the left is labeled with F, then FS

    A/F A/S A/S A/S A/F A/S A/S

    F yes

    S no

  • 8/2/2019 Over Fitting and Tbl

    43/46

    DT and TBL

    TBL is more powerful than DT

    Extra power of TBL comes from

    Transformations are applied in sequence

    Results of previous transformations are visible to followingtransformations.

  • 8/2/2019 Over Fitting and Tbl

    44/46

  • 8/2/2019 Over Fitting and Tbl

    45/46

    Brill Algorithm (More Detailed)

    1. Label every word token with its most

    likely tag (based on lexical generation

    probabilities).

    2. List the positions of tagging errors and

    their counts, by comparing with truth (T)

    3. For each error position, consider each

    instantiation I of X, Y, and Z in Rule

    template.

    If Y=T, increment improvements[I],else increment errors[I].

    4. Pick the I which results in the greatest

    error reduction, and add to output

    VB NN PREV1OR2TAG DT improves

    on 98 errors, but produces 18 new

    errors, so net decrease of 80 errors

    5. Apply that I to corpus

    6. Go to 2, unless stopping criterion is

    reached

    Most likely tag:

    P(NN|race) = .98

    P(VB|race) = .02

    Is/VBZ expected/VBN to/TO

    race/NNtomorrow/NN

    Rule template: Change a word fromtag X to tag Y when previous tag is

    Z

    Rule Instantiation for above example:

    NN VB PREV1OR2TAG TO

    Applying this rule yields:

    Is/VBZ expected/VBN to/TO

    race/VB tomorrow/NN

  • 8/2/2019 Over Fitting and Tbl

    46/46

    Handling Unknown Words

    Can also use the Brill

    method to learn how to tagunknown words

    Instead of using surroundingwords and tags, use affixinfo, capitalization, etc.

    Guess NNP if capitalized,

    NN otherwise.

    Or use the tag mostcommon for wordsending in the last 3letters.

    etc. TBL has also been applied to

    some parsing tasks

    Example Learned Rule Sequence

    for Unknown Words