Learning the Semantic Meaning of a Concept from the Web

Preview:

DESCRIPTION

Learning the Semantic Meaning of a Concept from the Web. Yang Yu Master’s Thesis Defense August 03, 2006. LIVING_THINGS. ANIMAL. PLANT. HUMAN. CAT. TREE. GRASS. MAN. WOMAN. ARBOR. FRUTEX. The Problem. - PowerPoint PPT Presentation

Citation preview

Learning the Semantic Meaning of a Concept from the Web

Yang YuMaster’s Thesis Defense

August 03, 2006

2

The Problem

Manually preparing training data for text classification based ontology mapping is expensive.

LIVING_THINGS

ANIMAL PLANT

HUMAN

MAN

CAT

WOMAN

TREE

ARBOR

GRASS

FRUTEX

3

The Thesis

Automatically collecting training data for the concept defined in an ontology.

Benefits Reduce the amount of human work Fully automated ontology mapping

http://www.google.com/

4

Overview

Background The semantic Web and ontology Ontology Mapping

Proposal System Experimental Results

WEAPONS ontology LIVING_THINGS ontology

Discussions and Conclusion

5

Semantic Web and Ontology

What is it? “an extension of the current web”

An Example

Find all types of jets that are made in the USA

USA

partOf

WAMade-in

6

Interoperability problem Independently developed ontologies for the

same or overlapped domain Mapping

r = f (Ci, Cj) where i=1, …, n and j=1, …, m; r {equivalent, subClassOf, superClassOf,

complement, overlapped, other}

Ontology Mapping

7

Approaches to Ontology Mapping Manual mapping String Matching Text classification

the semantic meaning of a concept is reflected in the training data that use the concept

Probabilistic feature model Classification Results highly depend on training data

8

Motivation

Preparing exemplars manually is costly

Billions of documents available on the web Search engines

9

The Proposal

Using the concept defined in an ontology as a query and processing the search results to obtain exemplars

Verification Build a prototype system Check ontology mapping results

10

System overview – Part I

Ontology A

Parser

Processor

Search Engine

HTML Docs

Queries

Text Files

Links to Web Pages

WWW

Retriever

Retriever

11

The parser (Query expansion)

FOOD+FRUIT+APPLE

FOOD

FRUIT

APPLEORANGE

living+things+plant+tree+arborarbor

living+things+plant+tree+Frutexfrutex

living+things+plant+grassgrass

living+things+plant+treetree

living+things+animal+human+womanwoman

living+things+animal+human+manman

living+things+animal+humanhuman

living+things+animal+catcat

living+things+plantplant

living+things+animalanimal

living+thingsliving+things

QueriesConcepts

12

The retriever

13

The processor

14

Naïve Bayes text classifier

Bow toolkit McCallum, Andrew Kachites, Bow: A toolkit for statistical language

modeling, text retrieval, classification and clustering,

http://www.cs.cmu.edu/~mccallum/bow 1996. rainbow -d model --index dir/* rainbow –d model –query

Bayes Rule Naïve Bayes text classifier

15

Bayes Rule

P (A | B) =

P (B | A) * P (A)

P (B)

P(A, B)

A

B

P (B | A) = P (A, B) / P (A)P (A | B) = P (A, B) / P (B)

posterior

Prior

Normalizing constant

Mitchell Tom, Machine Learning, McGraw Hill, 1997

16

Naïve Bayes classifier

A text classification problem “What’s the most probable classification of the new

instance given the training data?”

vj: category j. (a1, a2, …, an): attributes of a new document

So Naïve

(Mitchell Tom, Machine Learning, McGraw Hill) 1997

17

System overview– Part II

Ontology A Ontology BModel Builder

Mapping Results

Text Files (B)

CalculatorFeature Model

Text Files (A)

Rainbow

Rainbow

18

The model builder

LIVING_THINGS

ANIMAL PLANT

HUMAN

MAN

CAT

WOMAN

TREE

ARBOR

GRASS

FRUTEX

LIVING_THINGS

ANIMAL PLANT

HUMAN

MAN

CAT

WOMAN

TREE

ARBOR

GRASS

FRUTEX

Mutually exclusive and exhaustive Leaf classes C+ and C-

19

The calculator

Naïve Bayes text classifier tends to give extreme values (1/0)

Tasks Feed exemplars to the classifier one by one Keep records of classification results Take averages and generate report

20

An Example of the Calculator

APC

TANK-VEHICLE

AIR-DEFENSE-GUN

SAUDI-NAVAL-MISSILE-CRAFT

Classifier

200

10SAUDI-NAVAL-MISSILE-CRAFT

20AIR-DEFENSE-GUN

170TANK-VEHICLE

Num. of exemplars

Categories in WeaponsA.n3

P(TANK-VEHICLE | APC) = 170 /200= 0.85

P(AIR-DEFENSE-GUN | APC) = 0.10

P(SAUDI-NAVAL-MISSILE-CRAFT| APC) = 0.05

21

Experiments with WEAPONS ontology Information Interpretation and Integration

Conference (http://www.atl.lmco.com/projects/ontology/i3con.html) WeaponsA.n3 and WeaponsB.n3

Both over 80 classes defined More than 60 classes are leaf classes Similar structure

22

WeaponsA.n3Part of WeaponsA.n3

TANK-VEHICLE-

MODERN-NAVAL-SHIP

WEAPON

CONVENTIONAL-WEAPON

WARPLANEARMORED-COMBAT-VEHICLE

PATROL-CRAFTAIRCRAFT-CARRIER

SUPER-ETENDARD

23

WeaponsB.n3Part of WeaponsB.n3

TANK-VEHICLE-

MODERN-NAVAL-SHIP

WEAPON

CONVENTIONAL-WEAPON

WARPLANEARMORED-COMBAT-VEHICLE

LIGHT-TANK APC

PATROL-WARTER-CRAFT

AIRCRAFT-CARRIER

LIGHT-AIRCRAFT-CARRIER

PATROL-BOAT-RIVER

PATROL-BOAT

FIGHTER-PLANE

FIGHTER-ATTACK-PLANE

SUPER-ETENDARD-FIGHTER

24

Expected Results

TANK-VEHICLE SUPER-ETENDARD

LIGHT-TANK

APCPATROL-WARTER-CRAFT

AIRCRAFT-CARRIER

LIGHT-AIRCRAFT-CARRIER

PATROL-BOAT-RIVER

PATROL-BOAT

FIGHTER-PLANE

FIGHTER-ATTACK-PLANE

SUPER-ETENDARD-FIGHTER

PATROL-CRAFT

25

A Typical Report

APCAPC

SELF-PROPELLED-ARTILLERY 0.357180681

TANK-VEHICLE 0.277139274

ICBM 0.10423636

MRBM 0.080615147

TOWED-ARTILLERY 0.054724102

SUPPORT-VESSEL 0.023265054

PATROL-CRAFT 0.019570325

MOLOTOV-COCKTAIL 0.015032411

TORPEDO-CRAFT 0.013677696

SUPER-ETENDARD 0.009856519

MORTAR 0.00772997

AIR-DEFENSE-GUN 0.002997109

MACHINE-GUN 0.000211772

MOLOTOV-COCKTAIL 0.000187578

TRUCK-BOMB 0.000171675

AS-9-KYLE-ALCM 0.000156403

ARABIL-100-MISSILE 0.000111953

AL-HIJARAH-MISSILE 7.65E-05

OGHAB-MISSILE 7.12E-05

BADAR-2000 4.28E-05

P(APC | Ci) where i = 1 … 63

...... ……

26

classes with highest conditional probability

0.38MRBM0.49AIRCRAFT-CARRIERFIGHTER-PLANE

0.3TANK-VEHICLE0.56SILKWORM-MISSILE-MODLIGHT-TANK

0.66PATROL-CRAFT0.51SILKWORM-MISSILE-MODPATROL-BOAT

0.54PATROL-CRAFT0.65SILKWORM-MISSILE-MODPATROL-BOAT-RIVER

0.52PATROL-CRAFT0.28SILKWORM-MISSILE-MODPATROL-WATERCRAFT

0.38MRBM0.83SILKWORM-MISSILE-MODFIGHTER-ATTACK-PLANE

0.51MRBM0.66SILKWORM-MISSILE-MODSUPER-ETENDARD-FIGHTER

0.36SELF-PROPELLED-ARTILLERY0.46

SILKWORM-MISSILE-MODAPC

0.57AIRCRAFT-CARRIER0.65AIRCRAFT-CARRIERLIGHT-AIRCRAFT-CARRIER

ProbSentences with KeywordsProbWhole fileNew Classes

P(TANK-VEHICLE | APC ) = 0.28

P(SUPER-ETENDARD | SUPER-ETENDARD-FIGHTER ) = 0.21

27

different numbers of exemplars (whole)

0.49AIRCRAFT-CARRIER0.80

SILKWORM-MISSILE-MOD FIGHTER-PLANE

0.56SILKWORM-MISSILE-MOD0.62

SILKWORM-MISSILE-MODLIGHT-TANK

0.51SILKWORM-MISSILE-MOD0.64

SILKWORM-MISSILE-MODPATROL-BOAT

0.65SILKWORM-MISSILE-MOD0.89

SILKWORM-MISSILE-MODPATROL-BOAT-RIVER

0.28SILKWORM-MISSILE-MOD0.64

SILKWORM-MISSILE-MODPATROL-WATERCRAFT

0.83SILKWORM-MISSILE-MOD0.83

SILKWORM-MISSILE-MODFIGHTER-ATTACK-PLANE

0.66SILKWORM-MISSILE-MOD0.74

SILKWORM-MISSILE-MOD

SUPER-ETENDARD-FIGHTER

0.46SILKWORM-MISSILE-MOD0.65

SILKWORM-MISSILE-MODAPC

0.65AIRCRAFT-CARRIER0.60

SILKWORM-MISSILE-MOD

LIGHT-AIRCRAFT-CARRIER

ProbGroup-whole-100ProbGroup-whole-50New Classes

28

different numbers of exemplars (sentence)

0.38MRBM0.38MRBMFIGHTER-PLANE

0.3TANK-VEHICLE0.59

TANK-VEHICLELIGHT-TANK

0.66PATROL-CRAFT0.37

PATROL-CRAFTPATROL-BOAT

0.54PATROL-CRAFT0.36

PATROL-CRAFTPATROL-BOAT-RIVER

0.52PATROL-CRAFT0.49

PATROL-CRAFTPATROL-WATERCRAFT

0.38MRBM0.19ICBMFIGHTER-ATTACK-PLANE

0.51MRBM0.4HY-4-C-201-MISSILE

SUPER-ETENDARD-FIGHTER

0.36

SELF-PROPELLED-ARTILLERY0.54

TANK-VEHICLEAPC

0.57AIRCRAFT-CARRIER0.44

AIRCRAFT-CARRIER

LIGHT-AIRCRAFT-CARRIER

ProbGroup-sentence-100Prob

Group-sentence-50New Classes

29

Comparison of mapping accuracy of different groups of experiments

56%Group-sentence-100

67%Group-sentence-50

11%Group-whole-100

0%Group-whole-50

Mapping accuracy judged by desired class mappedGroups of experiments

Higher Conditional Probability

30

LIVING_THINGS

ANIMAL PLANT

HUMAN

MAN

CAT

WOMAN

TREE

ARBOR

GRASS

FRUTEX

GIRL

Level1

Level2

Level3

Experiment with LIVING_THINGS ontology P(MAN | HUMAN) P (WOMAN | HUMAN) Find a mapping for GIRL

HUMAN

MAN

WOMAN

31

Actual Experiment Results: L-1

0.380.410.24P(WOMAN | HUMAN)

0.620.580.75P(MAN | HUMAN)

Using first 200 exemplars

Using first 100 exemplars

Using first 50 exemplarsConditional Probability

HUMAN

MAN

WOMAN

Results of experiment (1)

32

LIVING_THINGS

ANIMAL PLANT

HUMAN

MAN

CAT

WOMAN

TREE

ARBOR

GRASS

FRUTEX

GIRL

Level1

Level2

Level3

Actual Experiment Results: L-2

1P(WOMAN | GIRL)

0P(MAN | GIRL)

0.30P(CAT | GIRL)

0.70P(HUMAN | GIRL)

0.23P(PLANT | GIRL)

0.76P(ANIMAL | GIRL)

0P(PYCNOGONID | GIRL)

0.43P(HUMAN | GIRL)

0.01P(CAT | GIRL)

0.56P(DOG | GIRL)

0.37P(MAN | GIRL)

0.63P(WOMAN | GIRL)

0.08P(CAT | GIRL)

0.92P(HUMAN | GIRL)

0.17P(PLANT | GIRL)

0.83P(ANIMAL | GIRL)

With clustering on exemplars Without clustering on exemplars

with additional classes

33

Actual Experiment Results: L-3

10.970.98P(WOMAN | GIRL)

00.030.02P(MAN | GIRL)

000P(PYCNOGONID | GIRL)

0.560.290.13P(DOG | GIRL)

0.010.150.01P(CAT | GIRL)

0.430.560.86P(HUMAN | GIRL)

0.230.470.34P(PLANT | GIRL)

0.770.530.66P(ANIMAL | GIRL)

Using first 200 exemplars

Using first 100 exemplars

Using first 50 exemplarsConditional Probability

Comparison between different numbers of exemplars (sentence)

34

Actual Experiment Results: Different Queries

Living+things+plant+Plantae+tree+arborarbor

Living+things+plant+Plantae+tree+Frutexfrutex

Living+things+plant+Plantae+grassgrass

Living+things+plant+Plantae+treetree

Living+things+animal+Animalia+human+intelligent+woman+femalewoman

Living+things+animal+Animalia+human+intelligent+man+maleman

Living+things+animal+Animalia+human+intelligenthuman

Living+things+animal+Animalia+cat+Felidaecat

Living+things+plant+Plantaeplant

Living+things+animal+Animaliaanimal

Living+thingsliving+things

QueriesConcepts

Queries augmented with class properties

35

Actual Experiment Results: L-4

0.070.09P(WOMAN | HUMAN)

0.930.91P(MAN | HUMAN)

Keyword SentencesWholeConditional Probability

0.840.86P(WOMAN | GIRL)

0.160.14P(MAN | GIRL)

0.170.22P(CAT | GIRL)

0.830.78P(HUMAN | GIRL)

0.170.1P(PLANT | GIRL)

0.830.9P(ANIMAL | GIRL)

Keyword SentencesWholeConditional Probability

HUMAN

MAN

WOMAN

LIVING_THINGS

ANIMAL PLANT

HUMAN

MAN

CAT

WOMAN

TREE

ARBOR

GRASS

FRUTEX

GIRL

Level1

Level2

Level3

Results of experiment (1) with new queries

Results of experiment (2) with new queries

36

Limitation 1: An exemplar is not a sample of a concept An exemplar is a combination of strings that

represent some usage of a concept. An exemplar is not an instance of a concept. The way we calculate conditional probability

is an estimation.

HUMAN

MAN

WOMAN

37

Limitation 2: Popularity does not equal relevancy Limited by a search engine’s algorithm

PageRank™ Popularity does not equal relevancy

Weight cannot be specified for words in a search query

38

Limitation 3: Relevancy does not equal to similarity

Search Results for concept A

Text related to concept A

Text against concept AText for concept A

i.e. desired exemplars

Text for related concept B

39

Related Research

UMBC OntoMapper Sushama Prasad, Peng Yun and Finin Tim, A Tool for Mapping between Two Ontologies

Using Explicit Information, AAMAS 2002 Workshop on Ontologies and Agent Systems, 2002. CAIMEN

Lacher S. Martin and Groh Georg ,Facilitating the Exchange of Explicit Knowledge through Ontology Mappings, Proc of the Fourteenth International FLAIRS conference, 2001.

GLUE Doan Anhai, Madhavan Jayant, Dhamankar Robin, Domingos Pedro, and Halevy Alon,

Learning to Match Ontologies on the Semantic Web, WWW2002, May, 2002.

Google Conditional Probability P(HUMAN | MAN) = 1.77 billion / 2.29 billion = 0.77 P(HUMAN | WOMAN) = 0.6 billion / 2.29 billion = 0.26 Wyatt D., Philipose M., and Choudhury T., Unsupervised Activity Recognition Using

Automatically Mined Common Sense. Proceedings of AAAI-05. pp. 21-27.

40

Conclusion and Future Work

Text retrieved from the web can be used as exemplars for text classification based ontology mapping Many parameters affect the quality of the

exemplars There are noise contained in the processed

documents Future work

Clustering

41

Questions

Recommended