Semantic Analysis using Wikipedia Taxonomy

Slide 1

Creating a taxonomyfor WikipediaPatrick Nicolas Feb 11, 2012http://patricknicolas.blogspot.comhttp://www.slideshare.net/pnicolashttps://github.com/prnicolas

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com2IntroductionNotes: Definitions and notations are defined in the appendicesThe presentation assumes the reader has basic knowledge in information retrieval, Natural Language Processing and Machine Learning.The goal of the study is to build a Taxonomy Graph for the 3+ millions Wikipedia entries by leveraging the WordNet hyponyms as a training set.

This model can used in a wide variety of commercial applications from extracting context extraction and automated Wiki classification to text summarization.

3ProcessExtract abstract & categories from Wikipedia datasets

Generate the Hypernyms lineages for Wikipedia entries which overlap with WordNet synsets

Extract, reduce and ordered N-Grams and their tags (NNP, NN,.) from each Wikipedia abstract.

Create a training set of weighted graphs of each Wikipedia abstract that has a corresponding hypernyms hierarchy

Optimize and apply the model for generating taxonomy lineages for each Wikipedia entryThe computation flow for the generation of taxonomy for Wikipedia is summarized in the following 5 simple steps.Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

4Semantic Data SourcesWord Net HypernymsWikipedia DatasetsTerms Frequency CorporaWordNet database of Synsets is used to generate hierarchy of hypernyms. entity/physical entity/object/location/region/district/country/European country/ItalyReuters corpus and Google N-Grams frequency is used to compute the inverse document frequency values.Entry (label), long abstract and categories are to be extracted from the Wikipedia reference database.Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

5N-Grams Extraction ModelN-GramTerm 1Term n

Frequency N-Gram in documentFrequency N-Gram in Universe (Corpus)N-Gram tagContained in 1st sentence?Similarity of N-Gramwith CategoriesFrequency N-Gram in categories abstractsFrequency of termsSemantic Definition?

idf

fD

The relevancy (or weight ) of a N-Gram to the context of a document depends on syntactic, semantic and probabilistic features.Fig. 1 Illustration of features of N-Gram Extraction Model

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

6

Computation FlowidfAbstractlabelCategoriesHypernymsLabeledLineageAbstractN-GramsWeighted N-Grams

Freq.N-Gram tagsSemantic matchNormalizedN-Gram WeightsTaxonomy GraphTrained ModelN-GramsCorpusWordNetSynsets

Fig. 2 Typical computation flow for generation of taxonomyThe computation flow is broken down in plug & play processing units to enable design of experiments and audit.WikipediaDatasetsCopyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

7NGrams Frequency Analysis

The inverse document frequency (IDF) is computed as

Lets define an N-Gram, w(n) (i.e. w(3) for a 3-Gram). The frequency of the N-Gram within the corpus C is expressed as.

Let w(n) be a N-Gram with a frequency count(w(n)) composed of terms wj j =1,n with a frequency count(wj) with a document D. The frequency of the N-Gram is computed as


8Weighting N-Grams

Most of Wikipedia concepts are well described in the first sentence of each abstract. Therefore we can attribute a great weight to N-Grams that are contained in the first sentence. The frequency f lD of a N-Grams in the 1st sentence of a document is defined as A simple regression analysis showed that a square root function provide a more accurate contribution (weight) of a N-Gram in a document D.


9Tagging N-GramsAlthough Conditional Random Fields is the predominant discriminative classifier to predict sentence boundaries, token tags we found out that the Maximum Entropy for binary features were more appropriate to classify the first term in a sentence (NNP or NN).

The model features functions ft (w) => {0,1} are extracted by maximizing the entropy H(p) of the probability of a word, w, has a specific tag t.

Subjected to the constraints..Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

10Wikipedia Tags Distribution

We extract the tags of Wikipedia entries (1 to 4-Grams) in the context of their abstracts. The distribution of the frequency of the tags shows that the proper nouns (stemmed as NNP tags) are the predominant tags.The frequency distribution is used as prior probability for finding a Wikipedia entry of a specific tag.


11Tag Predictive ModelWe use a multinomial Nave Bayes to predict the tag of any given Wikipedia entry.

Lets defined a set of classes Ck = { w(n) | tg(w(n)) = k } of Wikipedia entries of specific tags (CNNP NN) & p(t| Ck) the prior probability of a tag t to belong to a class.

The likelihood a given Wikipedia entry as a tag k is


12Lets define:taxonomy class (or Taxa) as a graph node representing a Hypernym (i.e. class=person)

taxonomy instance as entity name (i.e. instance=Peter or Peter IS-A a Person)

Taxonomy lineage as the list of ancestors (Hypernyms) of an instance

Fig. Example of taxonomy lineageTaxonomy Weighted GraphCopyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

13Any document can be represented as a weighted graph of taxonomy classes and instances.

Fig. Example of taxonomy graphDocument taxonomyCopyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

14Propagation Rule for Taxonomy Weights

The flow model is applied to the taxonomy weighted graph to compute the weight of each taxonomy class from the normalized weight of semantic N-Grams. The weights of taxonomy classes are normalized with the root entity ( =1 ). The taxonomy instances (N-Grams) are ordered & normalized by their respective weights ( wk(n) )

Fig. Weight propagation in Taxonomy GraphCopyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

15Normalized Taxonomy Weight in Wikipedia

We analyze the distribution of weights along the taxonomy lineage for all Wikipedia entriesCopyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

16Lineage Weights EstimatorThe training using the initial set of WordNet hypernyms shows that the distribution of normalized weights k along the taxonomy lineage for a specific similarity class C, can be approximated with polynomial function (spline).

This estimator is used in the classification of the taxonomy lineages of a Wikipedia abstract.Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

17Similarity MetricsIn order to train a model using labeled WordNet hypernyms, a similarity (or distance) metrics need to be defined. Lets consider 2 taxonomy lineages Vi and Vk of respective length n(k) and n(j)

Cosine DistanceShortest Path Distance


18Taxonomy Generation ModelLet consider m classes of taxonomy lineage similarity and labeled lineage VH . A class Ci is defined by A taxonomy lineage Vj is classified using Nave Bayes.


19Appendix: notation


20Appendix: ReferencesIntroduction to Information Retrieval C. Manning, P Raghavan, H Schtze Cambridge University PressElements of Statistical Learning T Hastie, R Tibshirani, J Friedman SpringerSemantic Taxonomy Induction from Heterogeneous Evidence R Snow, D Jurafsky, A NgA Study on Linking Wikipedia Categories to WordNet synsets using text similarity A Toral, O Fernandez, E Agirre, R MuozRegularization Predicts While Discovery Taxonomy Y. Mroueh, T Poggio, L RosascoNatural Language Semantics Term Project M Tao.A Maximum Entropy Approach to Natural Language Processing A Berger, V Della Pietra, S Della Pietra.Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

Technology

Semantic Analysis using Wikipedia Taxonomy