Nlp_pos and N-gram Models

Embed Size (px)

Citation preview

  • 8/7/2019 Nlp_pos and N-gram Models

    1/21

    Slide 0

    Outline

    Why part of speech tagging?

    Word classes

    Tag sets and problem definition

    N-Grams

  • 8/7/2019 Nlp_pos and N-gram Models

    2/21

    Slide 1

    POS

    The process of assigning a part-of-speech or

    other lexical class marker to each word in acorpus (Jurafsky and Martin)

    the

    motherkissed

    the

    child

    on

    the

    cheek

    WORDS

    TAGS

    N

    V

    P

    DET

  • 8/7/2019 Nlp_pos and N-gram Models

    3/21

    Slide 2

    An Example

    themother

    kiss

    the

    child

    onthe

    cheek

    LEMMA TAG

    DETNOUN

    VPAST

    DET

    NOUN

    PREPDET

    NOUN

    themother

    kissed

    the

    child

    onthe

    cheek

    WORD

  • 8/7/2019 Nlp_pos and N-gram Models

    4/21

    Slide 3

    Word Classes and Tag Sets

    Basic word classes: Noun, Verb, Adjective, Adverb,Preposition,

    Tag sets

    Vary in number of tags: a dozen to over 200 Size of tag sets depends on language, objectives and

    purpose

    Open vs. Closed classesOpen:

    Nouns, Verbs, Adjectives, Adverbs.

    Closed:Determiners, pronouns, prepositions, conjunction

  • 8/7/2019 Nlp_pos and N-gram Models

    5/21

    Slide 4

    Open Class Words

    Every known human language has nouns and verbs

    Nouns: people, places, thingsClasses of nouns

    proper vs. common

    count vs. mass

    Verbs: actions and processes

    Adjectives: properties, qualities

    Adverbs:

    Unfortunately, John walked home extremely slowly yesterdayNumerals: one, two, three, third,

  • 8/7/2019 Nlp_pos and N-gram Models

    6/21

    Slide 5

    Closed Class Words

    Differ more from language to language and over time than

    open class words

    Examples:

    prepositions: on, under, over, particles: up, down, on, off,

    determiners: a, an, the,

    pronouns: she, who, I, ..

    conjunctions: and, but, or,

    auxiliary verbs: can, may should,

  • 8/7/2019 Nlp_pos and N-gram Models

    7/21

    Slide 6

    Prepositions from CELEX

  • 8/7/2019 Nlp_pos and N-gram Models

    8/21

    Slide 7

    Pronouns in CELEX

  • 8/7/2019 Nlp_pos and N-gram Models

    9/21

    Slide 8

    Conjunctions

  • 8/7/2019 Nlp_pos and N-gram Models

    10/21

    Slide 9

    Auxiliaries

  • 8/7/2019 Nlp_pos and N-gram Models

    11/21

    Slide 10

    Word Classes: Tag set example

    PRP

    PRP$

  • 8/7/2019 Nlp_pos and N-gram Models

    12/21

    Slide 11

    Example ofPenn Treebank Tagging ofBrown Corpus Sentence

    The/DT grand/JJ jury/NN commented/VBD on/IN a/DT

    number/NN of/IN other/JJ topics/NNS ./.

    VB DT NN .

    Book that flight .

    VBZ DT NN VB NN ?

    Does that flight serve dinner ?

  • 8/7/2019 Nlp_pos and N-gram Models

    13/21

    Slide 12

    The Problem

    Words often have more than one word class: thisThis is a nice day = PRP

    This day is nice = DT

    You can go this far = RB

  • 8/7/2019 Nlp_pos and N-gram Models

    14/21

    Slide 13

    Part-of-Speech Tagging

    Rule-Based Tagger: ENGTWOL (ENGlish TWO Level

    analysis)

    Stochastic Tagger: HMM-based

    Transformation-Based Tagger (Brill)

  • 8/7/2019 Nlp_pos and N-gram Models

    15/21

    Slide 14

    Word prediction

    Guess the next word...... I notice three guys standing on the ???

    - by simply looking at the preceding words and keeping track of some fairly

    simple counts.

    We can formalize this task using what are called N-gram models.

    N-grams are token sequences of length N.

    The above example contains the following 2-grams (bigrams)(I notice), (notice three), (three guys), (guys standing), (standing on), (on the)

  • 8/7/2019 Nlp_pos and N-gram Models

    16/21

    Slide 15

    N-Gram Models

    More formally, we can use knowledge of the counts ofN-grams to

    assess the conditional probability of candidate words as the next

    word in a sequence.

    Or, we can use them to assess the probability of an entire sequence of

    words.

  • 8/7/2019 Nlp_pos and N-gram Models

    17/21

    Slide 16

    Counting

    Simple counting lies at the core of any probabilistic approach. So lets first

    take a look at what were counting.He stepped out into the hall, was delighted to encounter a water brother.

    15 tokens

    Not always that simpleI do uh main- mainly business data processing

    Spoken language poses various challenges.Should we count uh and other fillers as tokens?

    What about the repetition of mainly?

    The answers depend on the application. If were focusing on something like ASR

    to support indexing for search, then uh isnt helpful (its not likely to occur as a

    query). But filled pauses are very useful in dialog management, so we might

    want them there.

  • 8/7/2019 Nlp_pos and N-gram Models

    18/21

    Slide 17

    Counting: Types and Tokens

    How aboutThey picnicked by the pool, then lay back on the grass and looked at the

    stars.18 tokens (again counting punctuation)

    But we might also note that the is used 3 times, so there are only 16unique types (as opposed to tokens).

  • 8/7/2019 Nlp_pos and N-gram Models

    19/21

    Slide 18

    Counting: Corpora

    So what happens when we look at large bodies of text

    instead of single utterances?

    Brown et al (1992) large corpus ofEnglish text583 million wordform tokens

    293,181 wordform typesGoogle

    Crawl of 1,024,908,267,229 English tokens

    13,588,391 wordform typesThat seems like a lot of types... After all, even large dictionaries ofEnglish have only around

    500k types. Why so many here?

    Numbers

    Misspellings

    Names

    Acronyms

    etc

  • 8/7/2019 Nlp_pos and N-gram Models

    20/21

    Slide 19

    Language Modeling

    Back to word prediction

    We can model the word prediction task as the ability to assess the conditional

    probability of a word given the previous words in the sequenceP(wn|w1,w2wn-1)

    Calculating - conditional probabilityOne way is to use the definition of conditional probabilities and look for counts. So

    to get

    P(the | its water is so transparent that)

    By definition thats

    P(its water is so transparent that the)P(its water is so transparent that)

    We can get each of those from counts in a large corpus.

  • 8/7/2019 Nlp_pos and N-gram Models

    21/21

    Slide 20

    Very Easy Estimate

    How to estimate?P(the | its water is so transparent that)

    P(the | its water is so transparent that) =

    Count(its water is so transparent that the)

    Count(its water is so transparent that)

    According to Google those counts are 5/9.But 2 of those were to these slides... So maybe its really 3/7