29
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8– POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

CS460/626 : Natural Language Processing/Speech, NLP and ...cs626-460-2012/lecture... · CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8– POS tagset) Pushpak

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • CS460/626 : Natural Language Processing/Speech, NLP and the Web

    (Lecture 8– POS tagset)

    Pushpak BhattacharyyaCSE Dept., IIT Bombay

    17th Jan, 2012

  • HMM: Three ProblemsProblem

    Language

    Hindi

    Marathi

    English

    FrenchMorph

    Analysis

    Part of Speech

    Tagging

    Parsing

    Semantics

    CRF

    HMM

    MEMM

    NLP

    Trinity

    � Problem 1: Likelihood of a sequence

    � Forward Procedure

    � Backward ProcedureAlgorithm

    � Problem 2: Best state sequence

    � Viterbi Algorithm

    � Problem 3: Re-estimation

    � Baum-Welch ( Forward-Backward Algorithm )

    POS tagging

  • Tagged Corpora

    � ^_^“_“ The_DT guys_NNS that_WDT

    make_VBP traditional_JJ hardware_NN

    are_VBP really_RB being_VBG

    obsoleted_VBN by_IN microprocessor-

    based_JJ machines_NNS ,_, ”_”

    said_VBD Mr._NNP Benton_NNP ._.$_$

  • For Hindi

    � Rama achhaa gaata hai. (hai is VAUX :

    Auxiliary verb) ; Ram sings well

    Rama achha ladakaa hai. (hai is VCOP : � Rama achha ladakaa hai. (hai is VCOP :

    Copula verb) ; Ram is a good boy

  • Example of difficulty in POS tagging

  • Tags

    Content Word Function Word

    Noun Adjective Verb Tags PronounPreposition

    Noun Verb TagsConjunctio

    nInjection

    on

    Proper Noun

    Common Noun

    NNP(for NER)

    NNSNN

    VBP VBD VBG VBN

  • Difficulty in POS Tagging� Consider the following sentences:

    राम अ�छा गाता है_VAUX (auxiliary verb)

    Ram good sing is : Ram sings well

    GNPTAM for ‘गाता ‘ only : Male, Singular, ??,??,??,-GNPTAM for ‘गाता ‘ only : Male, Singular, ??,??,??,-GNPTAM for ‘गाता है’ : Male, Singular, 2nd or 3rd, Present, Default, Declarative

    राम अ�छा लड़का है_VCOP (copular verb)

    Ram good boy is : Ram is a good boy

    In general, VAUX, VM (main verb) and VCOP cannot be separated easily

  • To POS Tag based on Rules, one simple rule could be:

    है

    Difficulty in POS Tagging

    Preceded by nominal

    Preceded by verb

    This is a ‘High Precision, Low Recall’ rule, i.e. when it says Yes is indeed Yes but a No may not actually be No

    VAUX VCOPFacilitates co-referenceसामानािधकरण

  • Exceptions to the previous rule

    � False Negative for VAUX

    � Particle Injection (Particles: भी-Bhi, तो-To, ह�-Hi, नह�ं -Nahi)

    राम गाता तो अछा है, पर ... राम गाता तो अछा है, पर ...

    � Consider the following sentences:

    राम अ�छा है_VCOP

    राम तो गाता अ�छा है_VAUX

    POS TAGs of है vary here despite the preceding word being an adjective

  • Evaluation of POS Tag Accuracy

    � Precision, Recall and F-Score

    Given G(what our system returns)

    Ideal I(Actual Tags)

    AgreementAgreement

    False Positive

    False Negative

    • Precision P= |G ∩ I| / |I| Recall R= |G ∩ I| / |I|

    • F-Score = 2PR/(P+R)

  • POS tag computation (1/2)Best tag sequence= T*= argmax P(T|W)= argmax P(T)P(W|T) (by Baye’s Theorem)

    P(T) = P(t0=^ t1t2 … tn+1=.)P(T) = P(t0=^ t1t2 … tn+1=.)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …

    P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)

    = P(ti|ti-1) Bigram Assumption∏N+1

    i = 0

  • POS tag computation (2/2)

    P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

    Assumption: A word is determined completely by its tag. This is inspired by speech recognitioninspired by speech recognition

    = P(wo|to)P(w1|t1) … P(wn+1|tn+1)

    = P(wi|ti)

    = P(wi|ti) (Lexical Probability Assumption)

    ∏n+1

    i = 0

    ∏n+1

    i = 1

  • Example

    ”People jump high”.

    People : Noun/Verb

    jump : Noun/Verbjump : Noun/Verb

    high : Noun/Adjective

    We can start with probabilities.

  • ^

    VM

    N

    VM

    N

    JJ

    N

    $

    People

    Jump High^ $

    Trellis diagram

    8 POS TAG sequences are possible, given these valid tags for each word taken from dictionary

  • Bigram AssumptionBest tag sequence= T*= argmax P(T|W)= argmax P(T)P(W|T) (by Baye’s Theorem)

    P(T) = P(t0=^ t1t2 … tn+1=.)P(T) = P(t0=^ t1t2 … tn+1=.)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …

    P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)

    = P(ti|ti-1) Bigram Assumption∏N+1

    i = 0

  • Lexical Probability Assumption

    P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

    Assumption: A word is determined completely by its tag. This is inspired by speech recognitioninspired by speech recognition

    = P(wo|to)P(w1|t1) … P(wn+1|tn+1)

    = P(wi|ti)

    = P(wi|ti) (Lexical Probability Assumption)

    ∏n+1

    i = 0

    ∏n+1

    i = 1

  • Calculation from actual data� Corpus

    � ^ Ram got many NLP books. He found them all very interesting.

    � Pos Tagged� Pos Tagged

    � ^ N V A N N . N V N A R A .

  • Recording numbers^ N V A R .

    ^ 0 2 0 0 0 0

    N 0 1 2 1 0 1

    V 0 1 0 1 0 0

    A 0 1 0 0 1 1

    R 0 0 0 1 0 0

    . 1 0 0 0 0 0

  • Probabilities^ N V A R .

    ^ 0 1 0 0 0 0

    N 0 1/5 2/5 1/5 0 1/5

    V 0 1/2 0 1/2 0 0

    A 0 1/3 0 0 1/3 1/3

    R 0 0 0 1 0 0

    . 1 0 0 0 0 0

  • Penn tagset (1/2)

  • Penn tagset (2/2)

  • Indian Language Tagset: Noun

  • Indian Language Tagset: Pronoun

  • Indian Language Tagset: Quantifier

  • Indian Language Tagset: Demonstrative

    3 Demonstrative DM DM Vaha, jo, yaha,

    3.1 Deictic DMD DM__DMD Vaha, yaha

    3.2 Relative DMR DM__DMR jo, jis

    3.3 Wh-word DMQ DM__DMQ kis, kaun

    Indefinite DMI DM__DMI KoI, kis

  • Indian Language Tagset: Verb, Adjective, Adverb

  • Indian Language Tagset: Postposition, conjunction

  • Indian Language Tagset: Particle

  • Indian Language Tagset: Residuals