Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
KeNet: A COMPREHENSIVE TURKISH WORDNETAND SOME APPLICATIONS
Razieh Ehsani
May 7, 2018
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 1 / 69
Introduction
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 2 / 69
• Constructing a WordNet for Turkish Using Manual and AutomaticAnnotation• ACM Transactions on Asian and Low-Resource Language
Information Processing (TALLIP), 2018• Signal Processing and Communications Applications Conference
(SIU), 2017 25th• Clustering texts using WordNet relations• To be submitted
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 3 / 69
Turkish WordNet Construction
• Introduction• WordNets• Manual WordNet construction• Lexical resources• Processing the dictionary• Synonym candidates• Manual annotation• Inter-annotator agreement• Synset construction• Synset statistics• Semantic relations• Automatic WordNet construction• Comparison of Synsets
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 4 / 69
WordNet
• A WordNet is a graph data structure• Nodes are word senses with their associated lemmas• Edges are semantic relations between the sense pairs• (w5
2, w73, r1)
• (w2) is second meaning of (w5) which is called sense• r is semantic relation between w5 and w7
• r can be one of these relations synonym, hypernym, hyponym,antonym, meronym, ...• The direction of the relation implicit in the ordering of the elements of
the triple.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 5 / 69
WordNets
• PWN , Bottom- up, 117 000 synsets• EWN, top-down• Balkanet , top-down• Arabic WordNet , top-down• Finnish WordNet, top-down• Polish WordNet , top-down
And newborn Wordnet :KeNet, Bottom- up
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 6 / 69
BalkaNet & KeNet
POS tag # of synsets in KeNet # of synsets in BalkanetNoun 66 266 10 370Verb 25 170 2 359
Adjective 12 932 770Adverb 2 587 40Other 6 262 -Total 113 217 13 499
Table: POS tag distribution of KeNet and Balkanet
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 7 / 69
Lexical resources
Contemporary Dictionary of Turkish (CDT)domain
40 domainsdefault POSorigin
14,400 : Arabic (6,044), French (4,920), Persian (1,855),Italian (606), English (458), and Greek (382)
pronunciationcontext
argot, mockery, ...senses
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 8 / 69
Lexical resources
Field Possible valuesdomain anatomy, anthropology, military, computer science, botanic,
biology, geography, maritime, grammar, linguistics, theol-ogy, literature, economics, pedagogy, philosophy, physics,physiology, geometry, astronomy, zoology, law, geology,chemistry, mining, logic, mathematics, meteorology, archi-tecture, minerology, music, psychology, cinema, sports, his-tory, technical, commerce, theater, sociology, TV, medicine
POS verb, auxiliary verb, conjunction, postposition, commonnoun, adjective, pronoun, adverb, proper noun, exclama-tion
context mocking, argot, old usage, insult, popular, vulgar,metaphor, familiar, joking
Table: Field values for CDTRazieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 9 / 69
Problems with the resource Sense granularity
Sense inflation of word “yuz” (hundred)(i) The name of the number after ninety nine.(ii) The name of the numerals 100 and C that denote this number.(iii) Ten times ten, one more than ninety nine.(iv) A word that, when used together with “times” and “fold”,
exaggeratedly expresses the multitude of something done.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 10 / 69
Problems with the resource Productive derivations
Trivial derived nodes• Comprehension rather than terseness?• 5400 have only the obvious nominal sense
“sor-ma” (the act of asking)“sor-dur” (to cause (someone) to ask)“sor-ul” (be asked).
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 11 / 69
Problems with the resource Homonyms vs. Senses
• Homonyms vs. Senses• Maximum 5 homonyms
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 12 / 69
Problems with the resource Homonyms vs. Senses
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 13 / 69
Processing the Dictionary
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 14 / 69
Processing the Dictionary Synonym candidates
Synonym candidates for w acı (suffering)(i) Olum, yangın, deprem vb. olayların yarattıgı uzuntu, keder, elem
Feeling after death, earthquake, fire, grief, pain(ii) sıfat carpıcı, goz alıcı (renk)
Stunning, attractive (color)
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 15 / 69
Processing the Dictionary Handling MWEs
Verb stem Closest translation MWE count in CDTet do 1227ol be 298
ver give 88gel come 85kal stay 58git go 51yap do 45gec pass 43getir bring 30
goster show 20dur stay, stand 11kıl render 5yaz write 2eyle do 1Total 1964
Table: Auxiliary verbs in Turkish and their frequencies.Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 16 / 69
Manual Annotation
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 17 / 69
Manual Annotation
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 18 / 69
Manual Annotation Interannotator agreement
A & B A & E B & E E only Totalagreement percentage 85.62 3.53 9.72 1.13 100
# of pairs 42615 1759 4838 562 49774
Table: Inter-annotator agreement statistics.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 19 / 69
Manual Annotation Kappa measure
pc(i, j) = 1/(|Si||Sj |+ 1)
pc = .28
pa = .85
κ =pa − pc1− pc
= 0.79.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 20 / 69
Synset construction
benzer
yararl..
titiz
stabil
kibirli
eylem
muvaf..k
kararl..
güzel
sevimli
cana yak..n
afral.. tafral..
kalite
sonsuz
edat
meymenetli
k..rnak
itidal
ahenkli
muamele
emsal
pek
fiil tasrifi
nicelik
gururlu
de..me
ho..
düzey
..l..m
i..
e..
müfit
memeli
hâl
tamam
s..ps..cak
hay..rl..
yeknesak
koca
kalabal..k
yi..itlik
monoton
mehelsiz
yass..
uykuluk
fazla
al..ml..
ba..laç
uygun
mevzun
misil
cüret
biteviye
konum
kibar lokmas..
kibar
dekoratif
otçul
geni..
evire çevire
boyun borcu
orant..l..
upuygun
durmadan
..erit
etkisiz
büyük
belli ba..l..
mevki
mazbut
tek örnek
kusursuz
komik
tüm
kayda de..er
havadar
kelime türü
insicaml..
kal..p
örnek
ünlem
komple
model
münasebetli
vaziyet
bir dolu
sempatik
oylumlukayd..rak
yald..zc..l..k
keyif
zarif
ya....t
tav..r
koltuk
kas..nt..s..z
mostral..k
itibar
harikak..r..lmak
dengeli
içten
karaci..er
iyice
farz
adamak..ll..harfiyen
haza
valör
kâffe
müsavi
davranma
umut
asosyal
a..ama
e..it
s..k.. s..k..ya
sak..r sak..r
mahal
fiil çekimi
yi..itlik etmek
s..ms..cak
aslanl..k
çabuk
biçimli
de..erli
derne..ik
mütenasip
elveri..li
göstermelik
süslü
hesapl..
faydal..
sönmek
güldürücü
hep
intizaml..
apiko
miktar
muazzam
menent
de..er
atakl..k
kayan
zik..ymet
k..ymet
seçme
müsait
hesap
muttas..l
muttarit
tam
a..k..n
davran..m
fiil
cümle
söz bölükleri
kademlibir düziye
çok
....k
vira
koca koca
kesme
anlam
tesirsiz
k..yak
birçok
kesici di..
e..lenceli
mükemmel
karadul
esmer
cesaretlilik
hay..r
k..ymetli
mecmu
nosyon
eksiksiz
mü..ekkel
iktisat
v..z..r v..z..r
tastamam
mutab..k
ön ad
kanguru
fehva
uyumlu
e.. de..er
de..i..mez
makam
idare
yer
ha bire
cesaret
hasiyetli
konsept
usturuplu
seviye
bol
ustal..kl..
kavram
mazmun
nitelik
alengirli
seza
yarar
caiz
kostaklanmak
kalkan bal......
kip
mekân
meal
mehel
haval..
kar....la..t..rma derecesi
hareket
istikrarl..
düzgün
nazir
mefret
medlul
koskoca
sözcük türü
yal..n zarf
mefhum
kutlu
çal..ml..
varl..k
bir düzine
ödev
sürekli
yalab..k
konu..ucu
samimi
kahramanl..k
hayat
anlay....
yi..itle..mek
iri
deste
zarf
mostra
yok olmak
tasarruf
ferah
filinta
kül
davran....
koskocaman
matrak
hat..r
u..urlu
temelli
yak......kl..
bütün
kostak
ipka
dü..ünce
zamir
kurumlu
kuru
anaç
mevzi
t..pat..p
at..l
kadir
güçlü
derli toplu
iyi
mübarek
havalanmak
kutsal
düz
kadimî
intihap
misal
k..ymettar
denk
frapan
hat..rs..z
yara....r
muadil
s..cakkanl..
üstün
tevcih
kaçmak
kocaman kalkan
koç yi..it
düzenli
yerinde
ölçülülük
üstünsemek
tutarl..
mana
kesintisiz
tutum
uz
hatt..hareket
üstün zekâ
durum
kat
s..k.. s..k..
benbenci
kas..nt..l..
daha
seçkin
tekdüzes..cac..k
harfi harfine
önemli
oranl..
vecibe
yüreklilik
kayrak
gülünç
alpl..k
ölçülü
irticai
gösteri..li
tonton
teamülvezinli
virajs..z
aral..ks..z
s..fat
hayati
tükenmek
müteadditmebzul
k..rat
milimi milimine
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 21 / 69
Synset statistics
100 101 102 103 104
100
101
102
103
104
105
synset size
coun
t
Figure: The distribution of synset sizes.Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 22 / 69
Sense Drift
melal
ate..li
kut
andaval
haydi
basit
ba....bo..
hesapç..
uyand..rmak
ba..ar..s..z
parlak
güçlü
avar..z
kadir
yüz
becerikli
mücella
hav
menhus
kök
kal..n kafal..
neden
civelek
zorlu
katmerle..mek
vodvil
boktan
..ey
abus
sinir
hikmet
feci
çultutmaz
mahir
külhanbeyi
münasebet
ha..at
kara
derbeder
canl..
cehennemî
feyiz
ceza
terbiyesiz
bahtiyarl..k
kal..n kafa
hikâye
..iddetlenmek
k..r..lmak
budala
esnek
mukavemetli
men..e
endi..e
hasbelkader
külhani
kulp
etli canl..
ölüm cezas..
bükülgen
mutluluk
idam cezas..
haybeciçurçur
verim
k..l kuyruk
mevt
i..
kuru gürültü
feyezan
beceriksiz
fidan boylu
serseri
ap......k
peri..anl..k
anlay....s..z
hayk..rmak
e..ya
el
enez
kokmu..
..a..k..n
emrihak
dayan..ks..z
tüy s..klet
anif
yar..m yamalak
vecih
heyecan
ahiret yolculu..u
sorutkan
koyuntu
aylak
hasetlik
kademsiz
gabi
gazap
ate..in
haydi haydi
karabulut
ka..kaval
k..z......k
deprem
bo..
garaz
baya....
ba..lang..ç
al..k
tela..
yakmak
kavray....s..z
harabati
huru..an
u..ursuz
darmada....n..k
çehre
k..zg..n
k..p..rdak
s..hhatli
teessür
verimsiz
yalap..ap
kafas..z
kaba
çürük
dek
yer sars..nt..s..
artmak
kumkuma
gereksiz
kat..r
sayg..n
tesadüfen
haraza
verimlilik
velur
kerpiç
nursuz pirsiz
lamba
üstelik
kolsuz
e..ri yüzlü
ferah ferah
kolay kolay
avaramuattal
sorun
hayatiyetli
somurtkan
cilal..
külüstür
dermans..z
steril
somurtuk
sa..lam
erklik..z....mak
hayat
hadise
merak
üstünkörü
hareketli
kin
h..nç
yalan yanl....
üzüntü
yal..n ad
ilgisizlik
kâ....t helvas..
kor
..öyle bir
vurdulu k..rd..l..
ate..lenmek
lüzumsuz
yersiz
dü..üncesiz
har
ayd..n
ar..k
sökel
mebde
avantajs..z
harl..
kasvetli
gayretli
bön
tehlike
apa..
kökte..
usta
angora tav..an..
ne..e
it kopuk
özür
ba..
kasvet
ate..
avukatl..k
k..zg..nl..k
faal
avanak
kol
ne..eli
bek
rastgele
sonuçsuz
yalab..k
koç
metanetsiz
fasit
..iddetli
nurani
kavurmak
hasis
hayta
kaynak
avare
de..ersiz
ba..tan savma
göymek
s..k..nt..l..day..
ziyadar
......k
tek
takatsiz
ova
kayg..lanma
söz
koyun dede
hazine
vah..i
yerinme
ate..lemek
i..siz
iyiden iyiye
sokmak
vadi
kazma
mecalsiz
rüzgâr
meymenetsiz
koyun bak....l..
Ankara tav..an..
vah..et
musibet
nur
keder
yal..nç
karmakar......k
beniz
yürekler ac..s..
çukur
muhatara
haydari
gayur
kadife
hay..f
izans..z
kaba saba
k..tipiyoz
ç..ra
atik
hürmetsiz
kudretsiz
kafas.. örümcekli
kül etmek
mazeret
hüzün
yer biçimleri
öfke
angut
andavall..
kaknem
meret
k..r..k dökük
ho..nutluk
dinç
vurgun
kafas.. tembel
e..ri çehreli
da..dan inme
kip
diri
tutu..turmak
peri..an
ha..yet
uzun
me..um
hiçbirkulübe
kimsesiz
yalap ..alap
yald..rak
nurlu
kikirik
kaz
ipi k..r..k
hiddet
kolay
bak..ms..z
k..ymetsiz
t..r.. v..r..
defolu
kabaday..
kat..r kar..
..evk
da....n..k
mefluç
ziya
tasa
et kafal..
hayvan
memat
sert
ink..baz
idam
harap
..ss..z
kardinal ku..u
külotendams..z
mevsuk
so..ukluk
..im..eksersem
kuvvetli
üzünç
s..f..r
yorgun
......kl..
bitkin
her
yüzey ..ekilleri
pot
yal..n kat
k..yt..r..k
h..rs
teessüf
zay..f
ço..almak
mimik
tatar a..as..
hindi
s..ska
ziyal..
kazara
keskin
don
da.. adam..
vaka
stabil
ac..kl..
bereket
kele..
v..c..k v..c..k
alan
yarars..z
vesile
olay
tüy a....rl..k
patavats..zl..k
ayd..nl..k
kötü
tüvana
gayz
kopuk
s..k..nt..
......lt..l..
bahane
dert
e..reti
heyecanl..
yayla
çal....kan
sa..l..kl..
dayan..kl..
köken
çuha
k..s..r
trajik
bozuk
h..rdavat
par..lt..
kümes hayvanlar..
de..me
arg..n
kani..
dü..ünce
hüzünlü
ince
h..rbo
ölüm
so..uk
ha..lak
yaz..klanma
haybe
yat....mak
kuvvetsiz
hararetli
dokunakl..
irtihal
zihayat
ahmak
tutumlu
kuru
cüretkâr
yalandan
uyu..mak
haylaz
zamans..z
kolayca
sevinç
geli..igüzel
co..kun
kayg..
haz..k
merinos koyunu
tesirli
heyecanlanmak
karanl..k
vicdan azab..
c..l..z
k..sa
çelimsiz
hayat dolu
ate.. parças..
..i..irmece
ördek
sahipsiz
ferasetsiz
ar..za
kaynar
iki paral..k
etkilizinde
hat..rl..
üzülme
idareci
vefat
k..skançl..k
kavi
çürük i..
hareket
takatli
k..ç.. k..r..k
eski
kum torbas..
engebe
güçsüz
koyak
carlamak
önemsiz
yal..n isim
kayna....k
zait
k..k..rdak
hazin
co..mak
merinos
enayi
yetersiz
melankoli
hiç
azalmak
derme çatma
ac..nma
par..lt..l..
co..kulu
kano
mukavim
kör topal
ürküntü
kaz kafal..
fazla
muvaffakiyetsiz
herhangi bir
k..rt..pil
saadet
do..ru
i..siz güçsüz
kolayl..kla
..inanay
it
sayg..s..z
sebep
ke..
ziyadele..mek
gaf
s......r
korku
gerekçe
hesapl..
s..cak
aptal
mevsimli mevsimsiz
felaket
ongunluk
semere
yak..c..
at..l
müptezel
fazlala..mak
üzücü
gürlükkorkma
as..k suratl..
c..v..l c..v..l
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 23 / 69
Random walk a solution
101 102
100
101
102
103
104
synset size
coun
t
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 24 / 69
Semantic relations Antonym
The most common opposite pattern in CDT is l “karsıtı” (opposite of l),where l is a lemma.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 25 / 69
Semantic relations Hypernym-hyponym
chordate
vertebrate
aquatic-vertebrate
fish
anchovy trout
Figure: Example for Hypernym and Hyponym relation
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 26 / 69
Semantic relations Hypernyms and hyponyms
pattern in Turkish pattern in EnglishSUP-A verilen genel (ad,isim)DIr is the general name given to SUPbir SUP-DIr is a SUPSUP kavramlarından birisidir is one of SUP conceptsSUP (cesidi, turu, birisi)DIr is a (kind, one of) SUPSUBlArIn (butunu, tumu)dur is the whole of SUBs
Table: Patterns for hypernym candidates.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 27 / 69
Semantic relations
Relation Source CountAntonym CDT 376
Hypernym CDT 1420Hypernym Wikipedia 2764
Table: Statistics for the relation candidates
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 28 / 69
Semantic relations Domain
Domains NumbersTıp (Medicine) 291Din (Religion) 248Muzik (Music) 237Matematik (Mathematics) 242Cografya (Geography) 202Edebiyat (Literature) 191Hukuk (Law) 188Toplum bilimi (Community knowledge) 179Biyoloji (Biology) 176Anatomi (Anatomy) 158Akrabalık (Kinship) 53Egitim (Education) 52Asker (Military) 40
Table: Domains from CDT and VikisozukRazieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 29 / 69
Automatic WordNet construction
Let S(w) denote the definition of the ith sense of the entry for lemma was given in CDT.Let R denote a deterministic rule that generates the list of candidatelemmas from a given sense definition in the dictionary.For every entry in the dictionary, we define the set C(w) of candidatesynonym lemma for the lemma w as
C(w) = {v| ∃i, v ∈ R(S(w))}.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 30 / 69
Automatic WordNet construction
DefinitionLiterals w1 and w2 are strongly synonymous with respect to dictionaryD and rule R if
w1 ∈ C(w2) ∧ w2 ∈ C(w1).
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 31 / 69
Automatic WordNet construction
A weaker definition allows longer cycles in mapping lemmas tosynonym candidates. For this, we first define the longer synonymcandidacy relation among lemmas. Let us define the set of n-synonymcandidates Cn(x0) of the lemma x0 as
Cn(x0) = {xn| ∃x1, x2, . . . , xn−1, where xi 6= xj , 1 ≤ i, j < n,
x1 ∈ C(x0), x2 ∈ C(x1), . . . , xn ∈ C(xn−1)}.(1)
Combining all the paths up to length n, we define the weaklyn-synonym candidate set Cn for a lemma x0 as
Cn(x0) = C1(x0) ∪ C2(x0) . . . ∪ Cn(x0).
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 32 / 69
Automatic WordNet construction
Now we can define weakly n-synonymy.
DefinitionTwo lemmas w1 and w2 are weakly n-synonymous with respect todictionary D and rule R if
w1 ∈ Cn(w2) ∧ w2 ∈ Cn(w1).
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 33 / 69
Automatic Construction Automatic thesaurus
There are two rule for extract synonym candidates for a lemmaR1 :The front part of the neck and the organs forming this part, maw,
jugularR2 : rapport, environment created by mutual understanding and
tolerance, concord
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 34 / 69
Automatic Construction Automatic thesaurus
• Strong synonymy• Weakly n-synonymy
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 35 / 69
Automatic Construction Automatic thesaurus
100 101 102 103
100
101
102
103
104
105
synset size
coun
t
R1
R2
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 36 / 69
Comparison of Synsets
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 37 / 69
Comparison of Synsets Variation of information
V I(X,Y ) = −∑i,j
rij
[log
rijpi
+ logrijqj
],
where pi =|Xi|n, qj =
|Yj |n
and rij =|Xi ∩ Yj |
n.
V I(X,Y ) = H(X|Y ) +H(Y |X)
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 38 / 69
Comparison of Synsets Variation of information
Table: Variation of information among different synset construction methods.
Synset construction methodBS ASR1 ASR2 AMR1 AMR2
MS 0.138 0.066 0.100 0.527 0.326BS 0.134 0.161 0.607 0.384ASR1 0.030 0.241 0.155ASR2 0.265 0.158AMR1 0.272
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 39 / 69
Comparison of Synsets
Among the automated methods, the variation distances seem to alignthem on a line as ASR2 < ASR1 < AMR2 < AMR1. Thus, ASR2 andASR1 are quite similar since they confine their search within the set oflemmas which have a single sense. On the other hand, AMR2 andAMR1 are not as close. This is intuitively expected as when weconsider multiple senses, determining synonym candidates withcomma splitting or right splitting tend to make a larger difference in theresulting synsets.The same alignment can be observed when we compare BS and MSto automated methods. For both, the automated method that comesclosest is ASR1.Finally, we see that BS and MS are quite similar when projected ontothe set of single-sensed lemmas appearing in Balkanet.
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 40 / 69
Second Part : Graph-based Analysis using SemanticRelations
• Semantic clustering• Preprocessing text• Constructing textual graph• Representing text• Disambiguating synsets• Representatives for synsets• Co-occurrence graph• Textual graph analysis• Experimental results for clustering headlines• Page2Vec algorithm• Experimental Results• Conclusions
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 41 / 69
Understanding similarity in context
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 42 / 69
Semantic clustering
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 43 / 69
Preprocessing text
• Morphological analyzer• Morphological disambiguation• Convert to dictionary entry• nisan :nisan+Noun+A3sg+Pnon+Nom nisa+Noun+A3sg+P2sg+Nom• yazdı : yaz+Verb+Pos+Past+A3sg
yaz+Noun+A3sg+Pnon+NomDB+Verb+Zero+Past+A3sg• yaz + mAk• Stop words
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 44 / 69
Constructing textual graph
• Semantic relations• Frequency• Occurrence relations
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 45 / 69
Representing text
• pear, apple• despotism, freedom• compatriot, citizen
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 46 / 69
Disambiguating synsets
• ekmek-noun• ekmek-verb
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 47 / 69
Representatives for synsets
Betweenness centrality
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 48 / 69
Co-occurrence graph
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 49 / 69
Textual graph analysis
• Cumhuriyet left liberal newspaper• Hurriyet right liberal newspaper• Yeni akit fundamentalist newspaper• Yeni Safak fundamentalist newspaper• Aydınlık nationalist newspaper
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 50 / 69
Generalized Jaccard similarity
We used high pagerank scores to measure similarity between two textsIf X= (x1, x2, x3, ..., xn) and Y= (y1, y2, y3, .., yn) are two vectors andxi, yi ≥ 0 their Jaccard similarity is defined as :
J(X,Y ) =∑i
min(xi, yi)
max(xi, yi)
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 51 / 69
PageRank
Pagerank Sergey Brin and Larry Page• The number of links a node receives• The link propensity of the linkers• The centrality of the linkersThe word with the highest Pagerank score for Aydinlik (nationalist) is“homeland” and for Cumhuriyet (left liberal) is “republic”
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 52 / 69
Experimental results for clustering headlines
(a) 15 words with the highest Pagerank scores and weighted jaccard(b) 30 words with the highest Pagerank scores and weighted jaccard(c) All words and weighted jaccard(d) Basic approach: All words with simple jaccard (No PageRank)
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 53 / 69
Results
19 May, Commemoration of Ataturk
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 54 / 69
Results
23 April, National Sovereignty and Children’s Day
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 55 / 69
Results
15 July, “Coup” day
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 56 / 69
Results
16 July, 1 day after “Coup”, Ban!!!
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 57 / 69
Results
17 July, 2 days after “Coup” day
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 58 / 69
Page2Vec Algorithm
In order to transform a given text T into its co-occurence graph, we dothe following,
(i) Convert each word wi ∈ T to its representative ri.(ii) Assign ri’s as the nodes of the graph.(iii) For every consecutive words wi and wj in a sentence of the
original text T , draw an undirected edge between correspondingnodes ri and rj .
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 59 / 69
Page2Vec Algorithm
• Significant nodes in the graph• The highest pagerank scores• Every text can be represented by a weighted sum of those unified
vectors• Weights are the corresponding pagerank scores
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 60 / 69
Page2Vec Algorithm
Page2Vec Algorithm Input: text T• Construct co-occurance graph of text T ,• Extract top x = NumberOfWords representative ri’s based on their
PageRank scores σ(ri). Call this set 4T (x).• Translate each representative word ri to a vector −→vi , where columns
are 320 hypernyms and 80 categories.• Vectorize the text T as follows
−−→T (x) =
∑ri∈4T (x)
σ(ri)−→vi
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 61 / 69
Page2Vec algorithm
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 62 / 69
Page2Vec algorithm
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 63 / 69
Page2Vec algorithm
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 64 / 69
K-means clustering over Page2Vec outputs
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 65 / 69
K-means clustering over Doc2Vec outputs
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 66 / 69
Hierarchical clustering over Page2Vec outputs
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 67 / 69
Hierarchical clustering over Doc2Vec outputs
Razieh Ehsani KeNet: A COMPREHENSIVE TURKISH WORDNET AND SOME APPLICATIONSMay 7, 2018 68 / 69