Click here to load reader

Albert Gatt Corpora and statistical methods. In this lecture Overview of rules of probability multiplication rule subtraction rule Probability based on

Embed Size (px)

Citation preview

  • Slide 1
  • Albert Gatt Corpora and statistical methods
  • Slide 2
  • In this lecture Overview of rules of probability multiplication rule subtraction rule Probability based on prior knowledge conditional probability Bayes theorem
  • Slide 3
  • Conditional probability and independence Part 1
  • Slide 4
  • Prior knowledge Sometimes, an estimation of the probability of something is affected by what is known. cf. the many linguistic examples in Jurafsky 2003. Example: Part-of-speech tagging Task: Assign a label indicating the grammatical category to every word in a corpus of running text. one of the classic tasks in statistical NLP
  • Slide 5
  • Part-of-speech tagging example Statistical POS taggers are first trained on data that has been previously annotated. Yields a language model. Language models vary based on the n-gram window: unigrams: probability based on tokens (a lexicon) E.g. input = the_DET tall_ADJ man_NN model represents the probability that the word man is a noun (NB: it could also be a verb) bigrams: probabilities across a span of 2 words input = the_DET tall_ADJ man_NN model represents the probability that a DET is followed by an adjective, adjective is followed by a noun, etc. Can also do trigrams, quadrigrams etc.
  • Slide 6
  • POS tagging continued Suppose weve trained a tagger on annotated data. It has: a lexicon of unigrams: P(the=DET), P(man=NN), etc a bigram model P(DET is followed by ADJ), etc Assume weve trained it on a large input sample. We now feed it a new phrase: the audacious alien Our tagger knows that the word the is a DET, but its never seen the other words. It can: make a wild guess (not very useful!) estimate the probability that the is followed by an ADJ, and that an ADJ is followed by a NOUN
  • Slide 7
  • Prior knowledge revisited Given that I know that the is DET, whats the probability that the following word audacious is ADJ? This is very different from asking whats the probability that audacious is ADJ out of context. We have prior knowledge that DET has occurred. This can significantly change the estimate of the probability that audacious is ADJ. We therefore distinguish: prior probability: Nave estimate based on long-run frequency posterior probability: probability estimate based on prior knowledge
  • Slide 8
  • Conditional probability In our example, we were estimating: P(ADJ|DET) = probability of ADJ given DET P(NN|DET) = probability of NN given DET etc In general: the conditional probability P(A|B) is the probability that A occurs, given that we know that B has occurred
  • Slide 9
  • Example continued If Ive just seen a DET, whats the probability that my next word is an ADJ? Need to take into account: occurrences of ADJ in our training data VV+ADJ (was beautiful), PP+ADJ (with great concern), DET+ADJ etc occurrences of DET in our training corpus DET+N (the man), DET+V (the loving husband), DET+ADJ (the tall man)
  • Slide 10
  • Venn Diagram representation of the bigram training data AB the+tall a+simple an+excellent the+man the+woman a+road is+tall in+terrible were+nice Cases where w is ADJ NOT preceded by DET Cases where w is a DET NOT followed by ADJ Cases where w is a DET followed by ADJ
  • Slide 11
  • Estimation of conditional probability Intuition: P(A|B) is a ratio of the chances that both A and B happen, by the chances of B happening alone. P(ADJ|DET) = P(DET+ADJ) / P(DET)
  • Slide 12
  • Another example If we throw a die, whats the probability that the number we get is even, given that the number we get is larger than 4? works out as the probability of getting the number 6 P(even|>4) = P(even & >4)/P(>4) = (1/6) / (2/6) = = 0.5 Note the difference from simple, prior probability. Using only frequency, P(6)= 1/6
  • Slide 13
  • Mind the fallacies! When we speak of prior and posterior, we dont necessarily mean in time e.g. the die example Monte Carlo fallacy: if 20 turns of the roulette wheel have fallen on black, what are the chances that the next turn will fall on red? in reality, prior experience here makes no difference at all every turn of the wheel is independent from every other
  • Slide 14
  • The multiplication rule
  • Slide 15
  • Multiplying probabilities Often, were interested in switching the conditional probability estimate around. Suppose we know P(A|B) or P(B|A) We want to calculate P(A AND B) For both A and B to occur, they must occur in some sequence (first A occurs, then B)
  • Slide 16
  • Estimating P(A AND B) Probability that both A and B occur Probability of A happening overall Probability of B happening given that A has happened
  • Slide 17
  • Multiplication rule: example 1 We have a standard deck of 52 cards Whats the probability of pulling out two aces in a row? NB Standard deck has 4 aces Let A1 stand for an ace on the first pick, A2 for an ace on the second pick Were interested in P(A1 AND A2)
  • Slide 18
  • Example 1 continued P(A1 AND A2) = P(A1)P(A2|A1) P(A1) = 4/52 (since there are 4 aces in a 52-card pack) If we do pick an ace on the first pick, then we diminish the odds of picking a second ace (there are now 3 aces left in a 51-card pack). P(A2|A1) = 3/51 Overall: P(A1 AND A2) = (4/52) (3/51) =.0045
  • Slide 19
  • Example 2 We randomly pick two words, w1 and w2, out of a tagged corpus. What are the chances that both words are adjectives? Let ADJ be the set of all adjectives in the corpus (tokens, not types) |ADJ| = total number of adjectives A1 = the event of picking out an ADJ on the first try A2 = the event of picking out an ADJ on second try P(A1 AND A2) is estimated in the same way as per the previous example: in the event of A1, the chances of A2 are diminished the multiplication rule takes this into account
  • Slide 20
  • Some observations In these examples, the two events are not independent of eachother occurrence of one affects likelihood of the other e.g. drawing an ace first diminishes the likelihood of drawing a second ace this is sampling without replacement if we put the ace back into the pack after weve drawn it, then we have sampling with replacement In this case, the probability of one event doesnt affect the probability of the other.
  • Slide 21
  • Extending the multiplication rule The logic of the A AND B rule is: Both conditions, A and B have to be met A is met a fraction of the time B is met a fraction of the times that A is met Can be extended indefinitely E.g. chances of drawing 4 straight aces from a pack P(A1 & A2 & A3 & A4) = P(A1) P(A2|A1) P(A3|A1 & A2) P(A4|A1 & A2 & A3)
  • Slide 22
  • The subtraction rule
  • Slide 23
  • Extending the addition rule Its easy to extend the multiplication rule. Extending the addition rule isnt so easy. We need to correct for double-counting events.
  • Slide 24
  • Example P(A OR B OR C) A B C Once weve discounted the 2-way intersection of A and B, etc, we need to recount the 3-way intersection!
  • Slide 25
  • Subtraction rule Fundamental underlying observation: E.g. Probability of getting at least one head in 3 flips of a coin (a three-set addition problem) Can be estimated using the observation that: P(Head out of 3 flips) = 1-P(no heads) = 1-P(3 tails)
  • Slide 26
  • Bayes theorem Part 4
  • Slide 27
  • Switching conditional probabilities Problem 1: We know the probability that a test will give us positive in case a person has a disease. We want to know the probability that there is indeed a disease, given that the test says positive Useful for finding false positives Problem 2: We know the probability P(ADJ|DET) that some word w2 is an ADJ, given that the previous word w1 is a DET We find a new word w. We dont know its category. It might be a DET. We do know that the following word is an ADJ. We would therefore like to know the reverse, i.e. P(DET|ADJ)
  • Slide 28
  • Deriving Bayes rule from the multiplication rule Given symmetry of intersection, multiplication rule can be written in two ways: Bayes rule involves the substitution of one equation into the other, to replace P(A and B)
  • Slide 29
  • Deriving P(A) Often, its not clear where P(A) should come from we start out from conditional probabilities! Given that we have two sets of outcomes of interest, A and B, P(A) can be derived from the following observation: i.e. The events in A are made up of those which are only in A (but not in B) and those which are in both A and B.
  • Slide 30
  • Finding P(A) -- I A B P(A) must either be in one or the other (or both), since A is composed of these two sets.
  • Slide 31
  • Finding P(A) -- II Step 1: Applying the addition rule: Step 2: Substituting into Bayes equation to replace P(A):
  • Slide 32
  • Summary This ends our first foray into the rules of probability addition rule subtraction & multiplication rule conditional probability Bayes theorem
  • Slide 33
  • Next up Probability distributions Random variables Basic information theory