Persian Part Of Speech Tagging

Preview:

DESCRIPTION

Persian Part Of Speech Tagging. Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran. Decision Trees. Decision Tree (DT): Tree where the root and each internal node is labeled with a question. - PowerPoint PPT Presentation

Citation preview

1

Persian Part Of Speech Tagging

Mostafa Keikha

Database Research Group (DBRG)

ECE Department, University of Tehran

2

Decision Trees

Decision Tree (DT): Tree where the root and each internal node is

labeled with a question. The arcs represent each possible answer to the

associated question. Each leaf node represents a prediction of a

solution to the problem. Popular technique for classification; Leaf

node indicates class to which the corresponding tuple belongs.

3

Decision Tree Example

4

Decision Trees

A Decision Tree Model is a computational model consisting of three parts: Algorithm to create the tree Algorithm that applies the tree to data

Creation of the tree is the most difficult part. Processing is basically a search similar to that in

a binary search tree (although DT may not be binary).

5

Decision Tree Algorithm

6

Using DT in POS Tagging

Compute Ambiguity classes Each term may have

different tags Ambiguity class for each

term: set of all possible tags

compute # of occurrence for each tag in each ambiguity class

Ambiguity Class

# of occurrence

a b c d10 20 25 40

b c d 40 39 50

b d 60 55

7

Using DT in POS Tagging

Create Decision Tree on Ambiguity classes

In each level delete tag with minimum occurrence

a b c d10 20 25 40

b c d40 39 50

b d60 55

b

8

Using DT in POS Tagging

Advantage Easy to understand Easy to implement

Disadvantage Context independent

9

Using DT in POS Tagging

Known Tokens Results

Run PercentTokensCorrectAccuracy

197.9739392336376492.34%

298.0635563032896592.50%

397.9639752836778992.51%

497.9241056138157892.94%

597.9740307937230592.36%

Average97.976392144.2362880.292.474%

11

POS tagging using HMMs

Let W be a sequence of words W = w1 , w2 , … , wn

Let T be the corresponding tag sequence T = t1 , t2 , … , tn

Task : Find T which maximizes P ( T | W )

T’ = argmaxT P ( T | W )

12

POS tagging using HMMs

By Bayes Rule,

P ( T | W ) = P ( W | T ) * P ( T ) / P ( W )

T’ = argmaxT P ( W | T ) * P ( T )

Transition Probability,

P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | t1 … tn-1 )

Applying Tri-gram approximation,

P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

Introducing a dummy tag, $, to represent the beginning of a sentence,

P ( T ) = P ( t1 | $ ) * P ( t2 | $ t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

13

POS tagging using HMMs

Smoothing Transition Probabilities

Sparse data problem

Linear interpolation method

P'(ti | ti - 2 , ti - 1) = λ1 P( ti ) + λ2 P(ti | ti - 1 ) + λ3 P(ti | ti - 2 , ti - 1)

such that the s sum to 1

14

POS tagging using HMMs

Calculation of λs

15

POS tagging using HMMs

Emission Probability,

P(W | T ) ≈ P(w1 | t1) * P(w2 | t2) * . . . * P(wn | tn)

Context Dependency

To make more dependent on the context the emission probability is calculated as:

P(W | T ) ≈ P(w1 | $ t1) * P(w2 | t1 t2) ...* P(wn | tn-1 tn)

16

POS tagging using HMMs

Smoothing technique is applied

P' (wi | ti-1 ti) = θ1 P(wi | ti) + θ2 P(wi | ti-1 ti) Sum of all θs is equal to 1

θs are different for different words.

17

POS tagging using HMMs

1(

2(

3(

4(

5(

6(

18

POS tagging using HMMs

19

POS tagging using HMMs

20

POS tagging using HMMs

Lexicon generation probability

21

POS tagging using HMMs

22

P(N V ART N | files like a flower) = 4.37*10-6

POS tagging using HMMs

23

POS tagging using HMMs

Known Tokens Results

Run PercentTokensCorrectAccuracy

198.0739429038221196.94%

298.1634591334591397.18%

398.0439784934389496.96%

498.0241097039848796.96%

598.0740346039147597.03%

Average98.072390496.437239697.01%

24

Unknown Tokens Results

Run PercentTokensCorrectAccuracy

11.937760582975.12%

21.846689535780.09%

31.967956615377.34%

41.988283643577.69%

51.937945624678.62%

Average1.9287726.6600477.77%

25

Overall Results

Run TokensCorrectAccuracy

140205038804096.52%

236265835127096.86%

340580539189096.57%

441925340492296.58%

541140539772196.67%

Average400234.2386768.696.64%

Recommended