LiCord: Language Independent Content Word Finder

LiCord: Language Independent Content Word Finder

Md-Mizanur Rahoman, Tetsuya Nasukawa, Hiroshi Kanayama &Ryutaro Ichise

April 18, 2016

Background

currently 100s of languages are available, only few of them can beautomatically mined because of low or no NLP-resources availability

creating NLP-resources for all languages is not feasible

Content Words finding system for languages can be consideredbasic NLP-resource

Rahoman et.al., | LiCord | 2

Content Word

definition: Content Words [ref: American Heritage Dictionary]

are nouns, most verbs, adjectives, and adverbs that refer to someobject, action, or characteristiccarry independent meaningare usually open i.e, new words can be added

example: “NO8DO” is the official motto of Seville.

usage: Content Words can be used

(new) topic identificationdocument summarizingquestion answering etc.


Problem & Possible Solution

problemContent Words finding requires language dependent NLP-resource

language parserparallel corpora etc.

NLP-resource developing for all language is costly and “not feasible”

possible solution

morphological features of text segment can classify whether a segmentis Content Word

machine learning model can classify text segment into Content Wordbig text corpus can generate balanced morphological features for suchtext segments


System Framework

the model generation has four processes:

NGram Constructor − perform text segmentationFunction Word Decider − devise feature values for the segmentsFeature Value Calculator − devise feature values for the segmentsClassifier Learner − generate classification model to decide thesegments into Content Words


System Framework




1.NGram Constructor

segment text and construct variable token (length) n-grams

calculate n-gram frequencies

Table: Variable length n-grams and their frequencies for an exemplarycorpus T- = “Japan is an Asian country. Japan is a peaceful country”.

n-grams and frequencies over the T-

size 1 n-gram {[Japan−2], [is−2], [an−1], ..., }(/uni-gram) [country−2], [a−1], ... }size 2 n-gram {[Japan is−2], [is an−1], ..., }(/bi-gram) [Asian country−1], ...}size 3 n-gram {[Japan is an−1], [is an Asian−1], }(/tri-gram) [an Asian country−1], ... }


System Framework




2.Function Word Decider

Function Wordsexpress grammatical relationships with other wordshave little lexical meaning or have ambiguous meaningare frequent n-grams over a text documentexample: “the”, “in”, “in spite of” etc.

decide bypick a threshold number of frequent n-gramsmap frequent n-grams with available translation of known FunctionWords

use threshold only, if translation service is not available

n-gram # of token frq frq%the 1 3124631 67.60in 1 1774988 38.40... ... ... ...

united states 2 43698 0.94... ... ... ...


System Framework




3.Feature Value Calculator

select fifteen different morphological features of text & calculatetheir values for n-grams over a big corpus

where the n-grams appear i.e., begining/mid/end part of the sentenceshow frequent the n-grams appear in a corpushow the n-grams get added with Function Words, punctuationetc.


System Framework




4.Classifier Learner (1/2)

construct frequency-range-wise classification models

Reason

consume a large amount of time, if all n-grams are used as trainingexampledoes not represent entire dataset, if randomly pickedassume same frequency n-grams shares same kind of morphologicalfeatures (over the corpus)


4.Classifier Learner (2/2)

construct frequency-range-wise classification models

Method

collect range-based n-grams

X(i,j) = {x | x ∈ N ∧ i ≤ frq(x) ≤ j}N = all n-grams in corpus, x = n-gramselect threshold number of n-grams as training n-grams for each rangecalculate features for each range-wise selected n-gramslearn classification model for each range training n-grams


Experiment

check whether LiCord can identify Content Words languageindependently

analyzed language − English, Vietnamese, and Indonesianused training resource − Wikipedia Pages & Wikipedia Titles

+ve: when n-gram (text segment) exists on Wikipedia Title.E.g., Seville, official motto etc.-ve: otherwise.E.g.“NO8DO” is, is the etc.

classification algorithm − Support Vector Machine and C4.5(tree-based algorithm)


Language Independent Content Word Finding (1/2)

testing method − check test n-grams whether they are ContentWords

Table: CW finding accuracy %

Frequency English Indone- Vietnam-Range sian ese

(1,1) 76.68 90.56 90.30

(2,2) 83.00 93.20 94.15

(3,4) 84.37 94.23 94.76

(5,9) 83.87 95.89 93.97

(10,14) 87.09 96.15 94.95

Average 83.25 93.80 93.54


Language Independent Content Word Finding (2/2)

Newly discovered Content Words finding accuracy %

Frequency English Indone- Vietnam-Range sian ese

(1,1) 27.90 11.34 10.63

(2,2) 45.00 18.54 25.00

(3,4) 52.11 24.45 27.56

(5,9) 50.34 25.56 30.88

(10,14) 61.90 29.89 35.13

Average 47.45 21.95 22.50

finding − checking of a large number of sentences for their specificmorphological features over a big corpus can generate machinelearning model to find Content Words


Conclusion

language independent way Content Word finding a requirement incurrent days’ text mining

we propose a supervised Machine Learning technique to classifytext segments to Content Words

experiment results show proposed methods can serve as a ContentWord finder


Question & Suggestion

Md-Mizanur Rahoman, [email protected]


Experiment 1 (1/2)

purpose − whether LiCord can identify NEs (Named Entities), andact like sentence parser

identifying NEs − executed for some test sentences, compared withWikifier and Spotlight

Table: Comparison for LiCordwith Wikifier

Recall

Wikifier 33.33%LiCord 90.47%

Table: Comparison for LiCordwith Spotlight

Recall

Spotlight 83.33%LiCord 91.66%


Experiment 1 (2/2)

acting as parser − executed for some test sentences, compared withStanford parser for Content Words

Table: Comparison for LiCord with Parser

Language Recall

English 92.30%

finding − checking of a large number of sentences for their specificmorphological features over a big corpus can support wordsegmenting


Data & Analytics

LiCord: Language Independent Content Word Finder