25
1 On the Ambiguity of Serbian Texts and Methods to disambiguate it Cvetana Krstev, Duško Vitas, University of Belgrade 8 th Intex/Nooj Workshop

On the Ambiguity of Serbian Texts and Methods to disambiguate it

  • Upload
    faraji

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

On the Ambiguity of Serbian Texts and Methods to disambiguate it. Cvetana Krstev, Duško Vitas, University of Belgrade. 8 th Intex/Nooj Workshop. What is the ambiguity?. the assignment of different lemmas the assignment of different grammatical categories. gore. The ambiguity in Serbian. - PowerPoint PPT Presentation

Citation preview

Page 1: On the Ambiguity of Serbian Texts and Methods to disambiguate it

1

On the Ambiguity of Serbian Texts and Methods to

disambiguate it

Cvetana Krstev, Duško Vitas,

University of Belgrade

8th Intex/Nooj Workshop

Page 2: On the Ambiguity of Serbian Texts and Methods to disambiguate it

2

What is the ambiguity?

• the assignment of different lemmas• the assignment of different grammatical categories

Page 3: On the Ambiguity of Serbian Texts and Methods to disambiguate it

3

The ambiguity in Serbian

In Serbian many word forms are homographs although not homophones—stress marks are not recorded:gőre adv. upgőrē adv. worsegòrē P3s goreti,V+Ek to burngòre A3sgòrē P3s gorjeti,V+Ijk to burngòre A3sgòre fs2 gora forest

short long

up ő ô

down ò ó

gore

Page 4: On the Ambiguity of Serbian Texts and Methods to disambiguate it

4

The ambiguity in Serbian (2)rodoslovna,rodoslovni.A2+PosQ:akms2g:akms4v:aefs1g:aefs5g:akns2g:aenp1g:aenp4g:aenp5g

rodoslovne,rodoslovni.A2+PosQ:aemp4g:aefs2g:aefp1g:aefp4g:aefp5g

rodoslovni,rodoslovni.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g

rodoslovnih,rodoslovni.A2+PosQ:aemp2g:aefp2g:aenp2g

rodoslovnim,rodoslovni.A2+PosQ:aems6g:aemp3g:aemp6g:aemp7g:aefp3g:aefp6g:aefp7g:aens6g:aenp3g:aenp6g:aenp7g

rodoslovnima,rodoslovni.A2+PosQ:aemp3g:aemp6g:aemp7g:aefp3g:aefp6g:aefp7g:aenp3g:aenp6g:aenp7g

rodoslovno,rodoslovni.A2+PosQ:aens1g:aens4g:aens5g

rodoslovnog,rodoslovni.A2+PosQ:adms2g:adms4v:adns2g

rodoslovnoga,rodoslovni.A2+PosQ:adms2g:adms4v:adns2g

rodoslovnoj,rodoslovni.A2+PosQ:aefs3g:aefs7g

rodoslovnom,rodoslovni.A2+PosQ:adms3g:adms7g:aefs6g:adns3g:adns7g

← 9 sets of grammatical categories

e : form is the same for definite, indefinite

g : form is the same for animate and inanimate

Page 5: On the Ambiguity of Serbian Texts and Methods to disambiguate it

5

Disambiguation process

• Reconstructing word forms

• Using filter dictionaries

• Using restricted dictionaries

• Using dictionaries of compounds

• Using disambiguation grammars

Page 6: On the Ambiguity of Serbian Texts and Methods to disambiguate it

6

Reconstructing word forms – date adverbial phrases

Page 7: On the Ambiguity of Serbian Texts and Methods to disambiguate it

7

Reconstructing word forms – date adverbial phrases (2)

i izdavanxem YUBA kartica 20. februara 2002. godine.celog sistema. Zato je josx pocyetkom 1996. godine jedani www.plivamed.net. U petom mjesecu 2001.godine smo oformlxcxe biti odrzxan u novembru ove godine u Neumu, a za prvog

Simple forms

Assoc. lemmas

ratio Lemmas + categ.

ratio

54196 86915 1.60 174079 3.21

54126 86768 1.60 173727 3.21

Page 8: On the Ambiguity of Serbian Texts and Methods to disambiguate it

8

Reconstructing word forms – forms written with digits, etc.

Page 9: On the Ambiguity of Serbian Texts and Methods to disambiguate it

9

Reconstructing word forms – forms written with digits(2)

sxkovi iznosili oko 500 hilxada maraka. Znacyajna usxteda poput SAP-ovog ili IBM-ovog, dobijate i organizaciju firmecyelicyne industrije 1890-ih nije postojao. Ali, poznata jesveta drma tezxinom od 81,7 milijardi dolara u 160 zemalxa,

odnosno ukupno bezmalo pola milijarde (464 miliona)! Predxe

Simple forms

Assoc. lemmas

ratio Lemmas + categ.

ratio

54126 86768 1.60 173727 3.21

54064 86507 1.60 173693 3.21

Page 10: On the Ambiguity of Serbian Texts and Methods to disambiguate it

10

Using filter dictionaries

mi,ja.PRO01+Prs:sx3i

mi,mi.PRO03+Prs:px1r

mi,miti.V35+Imperf+Tr+Iref+Ref:Ays:Azs

li,li.PAR

li,liti.V98+Imperf+Tr+It+Iref:Ays:Azs

Page 11: On the Ambiguity of Serbian Texts and Methods to disambiguate it

11

Using filter dictionaries (2)

Very cautious filter dictionary with only 41 entries:

Simple forms

Assoc. lemmas

ratio Lemmas + categ.

ratio

54064 86507 1.60 173693 3.21

53858 81607 1.52 166908 3.10

Page 12: On the Ambiguity of Serbian Texts and Methods to disambiguate it

12

Using restricted dictionaries

• Dictionaries contain lemmas for both standard pronunciations – Ekavian and Ijekavian. Text, however, are usually written in only one.

• Dictionaries contain lemmas for both Serbian and Croatian language (or variant of Serbo-Croatian)

Page 13: On the Ambiguity of Serbian Texts and Methods to disambiguate it

13

Using restricted dictionaries (2)

crvene,crven.A17+Col:aemp4g:aefs2g:aefp1g:aefp4g:aefp5g

crvene,crveneti.V547+Imperf+It+Iref+Ref+Ek:Pzp:Ays:Azs

crvene,crveniti.V54+Imperf+Tr+Iref:Pzp

crvene,crvenxeti.V747+Imperf+It+Iref+Ref+Ijk:Pzp

Simple forms

Assoc. lemmas

ratio Lemmas + categ.

ratio

53858 81607 1.52 166908 3.10

53809 80890 1.50 165546 3.08

Page 14: On the Ambiguity of Serbian Texts and Methods to disambiguate it

14

Using dictionary of compounds

bez obzira na,bez obzira na.PREP+C+Ncn+p4bez,bez.PREP+p2na,na.INTna,na.PREP+p4+p7obzira,obzir.N1:ms2q:mp2qobzira,obzirati.V519+Imperf+It+Ref:Ays:Azs

Simple forms

Assoc. lemmas

ratio Lemmas + categ.

ratio

53809 80890 1.50 165546 3.08

48698 72597 1.49 147714 3.03

Page 15: On the Ambiguity of Serbian Texts and Methods to disambiguate it

15

Using disambiguation grammars – positional constraint

It is interjection, if it is followed by an exclamation mark.

Page 16: On the Ambiguity of Serbian Texts and Methods to disambiguate it

16

Using disambiguation grammars – positional constraint (2)

After sentence or phrase boundary, “mi” and “ti” are personal pronouns in nominative case (after other possibilities were excluded)

Page 17: On the Ambiguity of Serbian Texts and Methods to disambiguate it

17

Using disambiguation grammars – sequential constraint

“da” is a conjunction (and not a form of a verb dati – to give – if is followed by an auxiliary verb in clitic form)

Page 18: On the Ambiguity of Serbian Texts and Methods to disambiguate it

18

Using disambiguation grammars – sequential and positional constraints

sxargarepe evropska unija ne samo da je prihvatila nasxu ida,.CONJda,.ADVda,.INTda,.PARda,dati.V103+Perf+Tr+Iref+Ref:Pzs:Ays:Azs

Forms Assoc. lemmas

ratio Lemmas + categ.

ratio

48698 72595 1.49 147714 3.03

48698 71809 1.47 146491 3.01

Page 19: On the Ambiguity of Serbian Texts and Methods to disambiguate it

19

Using disambiguation grammars – agreement

An adjective, possessive pronoun or numeral has to agree in gender, number, and case with a noun that follows

Page 20: On the Ambiguity of Serbian Texts and Methods to disambiguate it

20

Using disambiguation grammars – agreement (2)

povecxati nxegov proboj u regionu. Rumunska proporcijau,.PREP+p2u,.PREP+p4u,.PREP+p7regionu,region.N1:ms3qregionu,region.N1:ms7q

Forms Assoc. lemmas

ratio Lemmas + categ.

ratio

48698 71809 1.47 146491 3.01

48698 66284 1.36 129167 2.65

Page 21: On the Ambiguity of Serbian Texts and Methods to disambiguate it

21

Using disambiguation grammars – agreement of personal names

Special rules of the agreement of first name and surname

Page 22: On the Ambiguity of Serbian Texts and Methods to disambiguate it

22

Using disambiguation grammars – agreement (2)

raspalio je Mladxan Dinkicx sxakom o okrugli sto "Platne kartice -

Mladxan,Mladxan.N1002+Hum+NProp+First+SR:ms1vMladxan,mladxan.A7:akms1g:akms4qDinkicx,Dinkicx.N28+NProp+Hum+Last+SR:ms1v

Forms Assoc. lemmas

ratio Lemmas + categ.

ratio

48698 66284 1.36 129167 3.65

48698 66255 1.36 129101 2.65

Page 23: On the Ambiguity of Serbian Texts and Methods to disambiguate it

23

The order of grammar application

←Apply first

Apply second →

Page 24: On the Ambiguity of Serbian Texts and Methods to disambiguate it

24

Careful construction of grammars

Syntactic ambiguity:Zalagacxu se da ti trosxkovi budu minimalni.

I will do my best to minimize these expences.I will do my best to minimize your expences.

Although some cases are much more frequent...Kličke je bio voljan da da automobil.

Klicke was willing to give the car.

Mislio sam da ti tvoja gospođa ne da da je viđaš. I thought that your misses is not giving to you to see her.

Page 25: On the Ambiguity of Serbian Texts and Methods to disambiguate it

25

Thank you!