69
KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONS Razieh Ehsani May 7, 2018 Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONS May 7, 2018 1 / 69

KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME … fileProblems with the resourceSense granularity Sense inflation of word “yuz”¨ (hundred) (i)The name of the number after ninety

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

KeNet: A COMPREHENSIVE TURKISH WORDNETAND SOME APPLICATIONS

Razieh Ehsani

May 7, 2018

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 1 / 69

Introduction

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 2 / 69

• Constructing a WordNet for Turkish Using Manual and AutomaticAnnotation• ACM Transactions on Asian and Low-Resource Language

Information Processing (TALLIP), 2018• Signal Processing and Communications Applications Conference

(SIU), 2017 25th• Clustering texts using WordNet relations• To be submitted

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 3 / 69

Turkish WordNet Construction

• Introduction• WordNets• Manual WordNet construction• Lexical resources• Processing the dictionary• Synonym candidates• Manual annotation• Inter-annotator agreement• Synset construction• Synset statistics• Semantic relations• Automatic WordNet construction• Comparison of Synsets

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 4 / 69

WordNet

• A WordNet is a graph data structure• Nodes are word senses with their associated lemmas• Edges are semantic relations between the sense pairs• (w5

2, w73, r1)

• (w2) is second meaning of (w5) which is called sense• r is semantic relation between w5 and w7

• r can be one of these relations synonym, hypernym, hyponym,antonym, meronym, ...• The direction of the relation implicit in the ordering of the elements of

the triple.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 5 / 69

WordNets

• PWN , Bottom- up, 117 000 synsets• EWN, top-down• Balkanet , top-down• Arabic WordNet , top-down• Finnish WordNet, top-down• Polish WordNet , top-down

And newborn Wordnet :KeNet, Bottom- up

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 6 / 69

BalkaNet & KeNet

POS tag # of synsets in KeNet # of synsets in BalkanetNoun 66 266 10 370Verb 25 170 2 359

Adjective 12 932 770Adverb 2 587 40Other 6 262 -Total 113 217 13 499

Table: POS tag distribution of KeNet and Balkanet

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 7 / 69

Lexical resources

Contemporary Dictionary of Turkish (CDT)domain

40 domainsdefault POSorigin

14,400 : Arabic (6,044), French (4,920), Persian (1,855),Italian (606), English (458), and Greek (382)

pronunciationcontext

argot, mockery, ...senses

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 8 / 69

Lexical resources

Field Possible valuesdomain anatomy, anthropology, military, computer science, botanic,

biology, geography, maritime, grammar, linguistics, theol-ogy, literature, economics, pedagogy, philosophy, physics,physiology, geometry, astronomy, zoology, law, geology,chemistry, mining, logic, mathematics, meteorology, archi-tecture, minerology, music, psychology, cinema, sports, his-tory, technical, commerce, theater, sociology, TV, medicine

POS verb, auxiliary verb, conjunction, postposition, commonnoun, adjective, pronoun, adverb, proper noun, exclama-tion

context mocking, argot, old usage, insult, popular, vulgar,metaphor, familiar, joking

Table: Field values for CDTRazieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 9 / 69

Problems with the resource Sense granularity

Sense inflation of word “yuz” (hundred)(i) The name of the number after ninety nine.(ii) The name of the numerals 100 and C that denote this number.(iii) Ten times ten, one more than ninety nine.(iv) A word that, when used together with “times” and “fold”,

exaggeratedly expresses the multitude of something done.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 10 / 69

Problems with the resource Productive derivations

Trivial derived nodes• Comprehension rather than terseness?• 5400 have only the obvious nominal sense

“sor-ma” (the act of asking)“sor-dur” (to cause (someone) to ask)“sor-ul” (be asked).

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 11 / 69

Problems with the resource Homonyms vs. Senses

• Homonyms vs. Senses• Maximum 5 homonyms

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 12 / 69

Problems with the resource Homonyms vs. Senses

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 13 / 69

Processing the Dictionary

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 14 / 69

Processing the Dictionary Synonym candidates

Synonym candidates for w acı (suffering)(i) Olum, yangın, deprem vb. olayların yarattıgı uzuntu, keder, elem

Feeling after death, earthquake, fire, grief, pain(ii) sıfat carpıcı, goz alıcı (renk)

Stunning, attractive (color)

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 15 / 69

Processing the Dictionary Handling MWEs

Verb stem Closest translation MWE count in CDTet do 1227ol be 298

ver give 88gel come 85kal stay 58git go 51yap do 45gec pass 43getir bring 30

goster show 20dur stay, stand 11kıl render 5yaz write 2eyle do 1Total 1964

Table: Auxiliary verbs in Turkish and their frequencies.Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 16 / 69

Manual Annotation

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 17 / 69

Manual Annotation

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 18 / 69

Manual Annotation Interannotator agreement

A & B A & E B & E E only Totalagreement percentage 85.62 3.53 9.72 1.13 100

# of pairs 42615 1759 4838 562 49774

Table: Inter-annotator agreement statistics.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 19 / 69

Manual Annotation Kappa measure

pc(i, j) = 1/(|Si||Sj |+ 1)

pc = .28

pa = .85

κ =pa − pc1− pc

= 0.79.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 20 / 69

Synset construction

benzer

yararl..

titiz

stabil

kibirli

eylem

muvaf..k

kararl..

güzel

sevimli

cana yak..n

afral.. tafral..

kalite

sonsuz

edat

meymenetli

k..rnak

itidal

ahenkli

muamele

emsal

pek

fiil tasrifi

nicelik

gururlu

de..me

ho..

düzey

..l..m

i..

e..

müfit

memeli

hâl

tamam

s..ps..cak

hay..rl..

yeknesak

koca

kalabal..k

yi..itlik

monoton

mehelsiz

yass..

uykuluk

fazla

al..ml..

ba..laç

uygun

mevzun

misil

cüret

biteviye

konum

kibar lokmas..

kibar

dekoratif

otçul

geni..

evire çevire

boyun borcu

orant..l..

upuygun

durmadan

..erit

etkisiz

büyük

belli ba..l..

mevki

mazbut

tek örnek

kusursuz

komik

tüm

kayda de..er

havadar

kelime türü

insicaml..

kal..p

örnek

ünlem

komple

model

münasebetli

vaziyet

bir dolu

sempatik

oylumlukayd..rak

yald..zc..l..k

keyif

zarif

ya....t

tav..r

koltuk

kas..nt..s..z

mostral..k

itibar

harikak..r..lmak

dengeli

içten

karaci..er

iyice

farz

adamak..ll..harfiyen

haza

valör

kâffe

müsavi

davranma

umut

asosyal

a..ama

e..it

s..k.. s..k..ya

sak..r sak..r

mahal

fiil çekimi

yi..itlik etmek

s..ms..cak

aslanl..k

çabuk

biçimli

de..erli

derne..ik

mütenasip

elveri..li

göstermelik

süslü

hesapl..

faydal..

sönmek

güldürücü

hep

intizaml..

apiko

miktar

muazzam

menent

de..er

atakl..k

kayan

zik..ymet

k..ymet

seçme

müsait

hesap

muttas..l

muttarit

tam

a..k..n

davran..m

fiil

cümle

söz bölükleri

kademlibir düziye

çok

....k

vira

koca koca

kesme

anlam

tesirsiz

k..yak

birçok

kesici di..

e..lenceli

mükemmel

karadul

esmer

cesaretlilik

hay..r

k..ymetli

mecmu

nosyon

eksiksiz

mü..ekkel

iktisat

v..z..r v..z..r

tastamam

mutab..k

ön ad

kanguru

fehva

uyumlu

e.. de..er

de..i..mez

makam

idare

yer

ha bire

cesaret

hasiyetli

konsept

usturuplu

seviye

bol

ustal..kl..

kavram

mazmun

nitelik

alengirli

seza

yarar

caiz

kostaklanmak

kalkan bal......

kip

mekân

meal

mehel

haval..

kar....la..t..rma derecesi

hareket

istikrarl..

düzgün

nazir

mefret

medlul

koskoca

sözcük türü

yal..n zarf

mefhum

kutlu

çal..ml..

varl..k

bir düzine

ödev

sürekli

yalab..k

konu..ucu

samimi

kahramanl..k

hayat

anlay....

yi..itle..mek

iri

deste

zarf

mostra

yok olmak

tasarruf

ferah

filinta

kül

davran....

koskocaman

matrak

hat..r

u..urlu

temelli

yak......kl..

bütün

kostak

ipka

dü..ünce

zamir

kurumlu

kuru

anaç

mevzi

t..pat..p

at..l

kadir

güçlü

derli toplu

iyi

mübarek

havalanmak

kutsal

düz

kadimî

intihap

misal

k..ymettar

denk

frapan

hat..rs..z

yara....r

muadil

s..cakkanl..

üstün

tevcih

kaçmak

kocaman kalkan

koç yi..it

düzenli

yerinde

ölçülülük

üstünsemek

tutarl..

mana

kesintisiz

tutum

uz

hatt..hareket

üstün zekâ

durum

kat

s..k.. s..k..

benbenci

kas..nt..l..

daha

seçkin

tekdüzes..cac..k

harfi harfine

önemli

oranl..

vecibe

yüreklilik

kayrak

gülünç

alpl..k

ölçülü

irticai

gösteri..li

tonton

teamülvezinli

virajs..z

aral..ks..z

s..fat

hayati

tükenmek

müteadditmebzul

k..rat

milimi milimine

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 21 / 69

Synset statistics

100 101 102 103 104

100

101

102

103

104

105

synset size

coun

t

Figure: The distribution of synset sizes.Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 22 / 69

Sense Drift

melal

ate..li

kut

andaval

haydi

basit

ba....bo..

hesapç..

uyand..rmak

ba..ar..s..z

parlak

güçlü

avar..z

kadir

yüz

becerikli

mücella

hav

menhus

kök

kal..n kafal..

neden

civelek

zorlu

katmerle..mek

vodvil

boktan

..ey

abus

sinir

hikmet

feci

çultutmaz

mahir

külhanbeyi

münasebet

ha..at

kara

derbeder

canl..

cehennemî

feyiz

ceza

terbiyesiz

bahtiyarl..k

kal..n kafa

hikâye

..iddetlenmek

k..r..lmak

budala

esnek

mukavemetli

men..e

endi..e

hasbelkader

külhani

kulp

etli canl..

ölüm cezas..

bükülgen

mutluluk

idam cezas..

haybeciçurçur

verim

k..l kuyruk

mevt

i..

kuru gürültü

feyezan

beceriksiz

fidan boylu

serseri

ap......k

peri..anl..k

anlay....s..z

hayk..rmak

e..ya

el

enez

kokmu..

..a..k..n

emrihak

dayan..ks..z

tüy s..klet

anif

yar..m yamalak

vecih

heyecan

ahiret yolculu..u

sorutkan

koyuntu

aylak

hasetlik

kademsiz

gabi

gazap

ate..in

haydi haydi

karabulut

ka..kaval

k..z......k

deprem

bo..

garaz

baya....

ba..lang..ç

al..k

tela..

yakmak

kavray....s..z

harabati

huru..an

u..ursuz

darmada....n..k

çehre

k..zg..n

k..p..rdak

s..hhatli

teessür

verimsiz

yalap..ap

kafas..z

kaba

çürük

dek

yer sars..nt..s..

artmak

kumkuma

gereksiz

kat..r

sayg..n

tesadüfen

haraza

verimlilik

velur

kerpiç

nursuz pirsiz

lamba

üstelik

kolsuz

e..ri yüzlü

ferah ferah

kolay kolay

avaramuattal

sorun

hayatiyetli

somurtkan

cilal..

külüstür

dermans..z

steril

somurtuk

sa..lam

erklik..z....mak

hayat

hadise

merak

üstünkörü

hareketli

kin

h..nç

yalan yanl....

üzüntü

yal..n ad

ilgisizlik

kâ....t helvas..

kor

..öyle bir

vurdulu k..rd..l..

ate..lenmek

lüzumsuz

yersiz

dü..üncesiz

har

ayd..n

ar..k

sökel

mebde

avantajs..z

harl..

kasvetli

gayretli

bön

tehlike

apa..

kökte..

usta

angora tav..an..

ne..e

it kopuk

özür

ba..

kasvet

ate..

avukatl..k

k..zg..nl..k

faal

avanak

kol

ne..eli

bek

rastgele

sonuçsuz

yalab..k

koç

metanetsiz

fasit

..iddetli

nurani

kavurmak

hasis

hayta

kaynak

avare

de..ersiz

ba..tan savma

göymek

s..k..nt..l..day..

ziyadar

......k

tek

takatsiz

ova

kayg..lanma

söz

koyun dede

hazine

vah..i

yerinme

ate..lemek

i..siz

iyiden iyiye

sokmak

vadi

kazma

mecalsiz

rüzgâr

meymenetsiz

koyun bak....l..

Ankara tav..an..

vah..et

musibet

nur

keder

yal..nç

karmakar......k

beniz

yürekler ac..s..

çukur

muhatara

haydari

gayur

kadife

hay..f

izans..z

kaba saba

k..tipiyoz

ç..ra

atik

hürmetsiz

kudretsiz

kafas.. örümcekli

kül etmek

mazeret

hüzün

yer biçimleri

öfke

angut

andavall..

kaknem

meret

k..r..k dökük

ho..nutluk

dinç

vurgun

kafas.. tembel

e..ri çehreli

da..dan inme

kip

diri

tutu..turmak

peri..an

ha..yet

uzun

me..um

hiçbirkulübe

kimsesiz

yalap ..alap

yald..rak

nurlu

kikirik

kaz

ipi k..r..k

hiddet

kolay

bak..ms..z

k..ymetsiz

t..r.. v..r..

defolu

kabaday..

kat..r kar..

..evk

da....n..k

mefluç

ziya

tasa

et kafal..

hayvan

memat

sert

ink..baz

idam

harap

..ss..z

kardinal ku..u

külotendams..z

mevsuk

so..ukluk

..im..eksersem

kuvvetli

üzünç

s..f..r

yorgun

......kl..

bitkin

her

yüzey ..ekilleri

pot

yal..n kat

k..yt..r..k

h..rs

teessüf

zay..f

ço..almak

mimik

tatar a..as..

hindi

s..ska

ziyal..

kazara

keskin

don

da.. adam..

vaka

stabil

ac..kl..

bereket

kele..

v..c..k v..c..k

alan

yarars..z

vesile

olay

tüy a....rl..k

patavats..zl..k

ayd..nl..k

kötü

tüvana

gayz

kopuk

s..k..nt..

......lt..l..

bahane

dert

e..reti

heyecanl..

yayla

çal....kan

sa..l..kl..

dayan..kl..

köken

çuha

k..s..r

trajik

bozuk

h..rdavat

par..lt..

kümes hayvanlar..

de..me

arg..n

kani..

dü..ünce

hüzünlü

ince

h..rbo

ölüm

so..uk

ha..lak

yaz..klanma

haybe

yat....mak

kuvvetsiz

hararetli

dokunakl..

irtihal

zihayat

ahmak

tutumlu

kuru

cüretkâr

yalandan

uyu..mak

haylaz

zamans..z

kolayca

sevinç

geli..igüzel

co..kun

kayg..

haz..k

merinos koyunu

tesirli

heyecanlanmak

karanl..k

vicdan azab..

c..l..z

k..sa

çelimsiz

hayat dolu

ate.. parças..

..i..irmece

ördek

sahipsiz

ferasetsiz

ar..za

kaynar

iki paral..k

etkilizinde

hat..rl..

üzülme

idareci

vefat

k..skançl..k

kavi

çürük i..

hareket

takatli

k..ç.. k..r..k

eski

kum torbas..

engebe

güçsüz

koyak

carlamak

önemsiz

yal..n isim

kayna....k

zait

k..k..rdak

hazin

co..mak

merinos

enayi

yetersiz

melankoli

hiç

azalmak

derme çatma

ac..nma

par..lt..l..

co..kulu

kano

mukavim

kör topal

ürküntü

kaz kafal..

fazla

muvaffakiyetsiz

herhangi bir

k..rt..pil

saadet

do..ru

i..siz güçsüz

kolayl..kla

..inanay

it

sayg..s..z

sebep

ke..

ziyadele..mek

gaf

s......r

korku

gerekçe

hesapl..

s..cak

aptal

mevsimli mevsimsiz

felaket

ongunluk

semere

yak..c..

at..l

müptezel

fazlala..mak

üzücü

gürlükkorkma

as..k suratl..

c..v..l c..v..l

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 23 / 69

Random walk a solution

101 102

100

101

102

103

104

synset size

coun

t

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 24 / 69

Semantic relations Antonym

The most common opposite pattern in CDT is l “karsıtı” (opposite of l),where l is a lemma.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 25 / 69

Semantic relations Hypernym-hyponym

chordate

vertebrate

aquatic-vertebrate

fish

anchovy trout

Figure: Example for Hypernym and Hyponym relation

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 26 / 69

Semantic relations Hypernyms and hyponyms

pattern in Turkish pattern in EnglishSUP-A verilen genel (ad,isim)DIr is the general name given to SUPbir SUP-DIr is a SUPSUP kavramlarından birisidir is one of SUP conceptsSUP (cesidi, turu, birisi)DIr is a (kind, one of) SUPSUBlArIn (butunu, tumu)dur is the whole of SUBs

Table: Patterns for hypernym candidates.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 27 / 69

Semantic relations

Relation Source CountAntonym CDT 376

Hypernym CDT 1420Hypernym Wikipedia 2764

Table: Statistics for the relation candidates

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 28 / 69

Semantic relations Domain

Domains NumbersTıp (Medicine) 291Din (Religion) 248Muzik (Music) 237Matematik (Mathematics) 242Cografya (Geography) 202Edebiyat (Literature) 191Hukuk (Law) 188Toplum bilimi (Community knowledge) 179Biyoloji (Biology) 176Anatomi (Anatomy) 158Akrabalık (Kinship) 53Egitim (Education) 52Asker (Military) 40

Table: Domains from CDT and VikisozukRazieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 29 / 69

Automatic WordNet construction

Let S(w) denote the definition of the ith sense of the entry for lemma was given in CDT.Let R denote a deterministic rule that generates the list of candidatelemmas from a given sense definition in the dictionary.For every entry in the dictionary, we define the set C(w) of candidatesynonym lemma for the lemma w as

C(w) = {v| ∃i, v ∈ R(S(w))}.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 30 / 69

Automatic WordNet construction

DefinitionLiterals w1 and w2 are strongly synonymous with respect to dictionaryD and rule R if

w1 ∈ C(w2) ∧ w2 ∈ C(w1).

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 31 / 69

Automatic WordNet construction

A weaker definition allows longer cycles in mapping lemmas tosynonym candidates. For this, we first define the longer synonymcandidacy relation among lemmas. Let us define the set of n-synonymcandidates Cn(x0) of the lemma x0 as

Cn(x0) = {xn| ∃x1, x2, . . . , xn−1, where xi 6= xj , 1 ≤ i, j < n,

x1 ∈ C(x0), x2 ∈ C(x1), . . . , xn ∈ C(xn−1)}.(1)

Combining all the paths up to length n, we define the weaklyn-synonym candidate set Cn for a lemma x0 as

Cn(x0) = C1(x0) ∪ C2(x0) . . . ∪ Cn(x0).

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 32 / 69

Automatic WordNet construction

Now we can define weakly n-synonymy.

DefinitionTwo lemmas w1 and w2 are weakly n-synonymous with respect todictionary D and rule R if

w1 ∈ Cn(w2) ∧ w2 ∈ Cn(w1).

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 33 / 69

Automatic Construction Automatic thesaurus

There are two rule for extract synonym candidates for a lemmaR1 :The front part of the neck and the organs forming this part, maw,

jugularR2 : rapport, environment created by mutual understanding and

tolerance, concord

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 34 / 69

Automatic Construction Automatic thesaurus

• Strong synonymy• Weakly n-synonymy

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 35 / 69

Automatic Construction Automatic thesaurus

100 101 102 103

100

101

102

103

104

105

synset size

coun

t

R1

R2

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 36 / 69

Comparison of Synsets

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 37 / 69

Comparison of Synsets Variation of information

V I(X,Y ) = −∑i,j

rij

[log

rijpi

+ logrijqj

],

where pi =|Xi|n, qj =

|Yj |n

and rij =|Xi ∩ Yj |

n.

V I(X,Y ) = H(X|Y ) +H(Y |X)

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 38 / 69

Comparison of Synsets Variation of information

Table: Variation of information among different synset construction methods.

Synset construction methodBS ASR1 ASR2 AMR1 AMR2

MS 0.138 0.066 0.100 0.527 0.326BS 0.134 0.161 0.607 0.384ASR1 0.030 0.241 0.155ASR2 0.265 0.158AMR1 0.272

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 39 / 69

Comparison of Synsets

Among the automated methods, the variation distances seem to alignthem on a line as ASR2 < ASR1 < AMR2 < AMR1. Thus, ASR2 andASR1 are quite similar since they confine their search within the set oflemmas which have a single sense. On the other hand, AMR2 andAMR1 are not as close. This is intuitively expected as when weconsider multiple senses, determining synonym candidates withcomma splitting or right splitting tend to make a larger difference in theresulting synsets.The same alignment can be observed when we compare BS and MSto automated methods. For both, the automated method that comesclosest is ASR1.Finally, we see that BS and MS are quite similar when projected ontothe set of single-sensed lemmas appearing in Balkanet.

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 40 / 69

Second Part : Graph-based Analysis using SemanticRelations

• Semantic clustering• Preprocessing text• Constructing textual graph• Representing text• Disambiguating synsets• Representatives for synsets• Co-occurrence graph• Textual graph analysis• Experimental results for clustering headlines• Page2Vec algorithm• Experimental Results• Conclusions

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 41 / 69

Understanding similarity in context

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 42 / 69

Semantic clustering

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 43 / 69

Preprocessing text

• Morphological analyzer• Morphological disambiguation• Convert to dictionary entry• nisan :nisan+Noun+A3sg+Pnon+Nom nisa+Noun+A3sg+P2sg+Nom• yazdı : yaz+Verb+Pos+Past+A3sg

yaz+Noun+A3sg+Pnon+NomDB+Verb+Zero+Past+A3sg• yaz + mAk• Stop words

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 44 / 69

Constructing textual graph

• Semantic relations• Frequency• Occurrence relations

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 45 / 69

Representing text

• pear, apple• despotism, freedom• compatriot, citizen

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 46 / 69

Disambiguating synsets

• ekmek-noun• ekmek-verb

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 47 / 69

Representatives for synsets

Betweenness centrality

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 48 / 69

Co-occurrence graph

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 49 / 69

Textual graph analysis

• Cumhuriyet left liberal newspaper• Hurriyet right liberal newspaper• Yeni akit fundamentalist newspaper• Yeni Safak fundamentalist newspaper• Aydınlık nationalist newspaper

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 50 / 69

Generalized Jaccard similarity

We used high pagerank scores to measure similarity between two textsIf X= (x1, x2, x3, ..., xn) and Y= (y1, y2, y3, .., yn) are two vectors andxi, yi ≥ 0 their Jaccard similarity is defined as :

J(X,Y ) =∑i

min(xi, yi)

max(xi, yi)

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 51 / 69

PageRank

Pagerank Sergey Brin and Larry Page• The number of links a node receives• The link propensity of the linkers• The centrality of the linkersThe word with the highest Pagerank score for Aydinlik (nationalist) is“homeland” and for Cumhuriyet (left liberal) is “republic”

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 52 / 69

Experimental results for clustering headlines

(a) 15 words with the highest Pagerank scores and weighted jaccard(b) 30 words with the highest Pagerank scores and weighted jaccard(c) All words and weighted jaccard(d) Basic approach: All words with simple jaccard (No PageRank)

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 53 / 69

Results

19 May, Commemoration of Ataturk

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 54 / 69

Results

23 April, National Sovereignty and Children’s Day

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 55 / 69

Results

15 July, “Coup” day

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 56 / 69

Results

16 July, 1 day after “Coup”, Ban!!!

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 57 / 69

Results

17 July, 2 days after “Coup” day

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 58 / 69

Page2Vec Algorithm

In order to transform a given text T into its co-occurence graph, we dothe following,

(i) Convert each word wi ∈ T to its representative ri.(ii) Assign ri’s as the nodes of the graph.(iii) For every consecutive words wi and wj in a sentence of the

original text T , draw an undirected edge between correspondingnodes ri and rj .

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 59 / 69

Page2Vec Algorithm

• Significant nodes in the graph• The highest pagerank scores• Every text can be represented by a weighted sum of those unified

vectors• Weights are the corresponding pagerank scores

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 60 / 69

Page2Vec Algorithm

Page2Vec Algorithm Input: text T• Construct co-occurance graph of text T ,• Extract top x = NumberOfWords representative ri’s based on their

PageRank scores σ(ri). Call this set 4T (x).• Translate each representative word ri to a vector −→vi , where columns

are 320 hypernyms and 80 categories.• Vectorize the text T as follows

−−→T (x) =

∑ri∈4T (x)

σ(ri)−→vi

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 61 / 69

Page2Vec algorithm

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 62 / 69

Page2Vec algorithm

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 63 / 69

Page2Vec algorithm

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 64 / 69

K-means clustering over Page2Vec outputs

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 65 / 69

K-means clustering over Doc2Vec outputs

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 66 / 69

Hierarchical clustering over Page2Vec outputs

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 67 / 69

Hierarchical clustering over Doc2Vec outputs

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 68 / 69

Thanks

Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 69 / 69