46
Personal Ontology Learning Grace Hui Yang Language Technologies Institute, Carnegie Mellon University [email protected] Thesis Committee: Jamie Callan (CMU,Chair) Jaime Carbonell (CMU) Christos Faloutsos (CMU) Eduard Hovy (ISI/USC) Nov 9, 2011 Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory agencies receive and deal with large amount of public comments everyday By law, they need to read each of them A few rules attracts hundreds of thousands emails per year Government employees needs to quickly overview the “lay of the land” 2 Ph.D. Defense, Nov 9, 2011

Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Personal Ontology LearningGrace Hui Yang

Language Technologies Institute, Carnegie Mellon [email protected]

Thesis Committee:Jamie Callan (CMU,Chair)Jaime Carbonell (CMU)Christos Faloutsos (CMU)Eduard Hovy (ISI/USC)

Nov 9, 2011

Ph.D. Thesis Defense Talk

1

Notice Comment Rulemaking

� U.S. regulatory agencies receive and deal with large amount of public comments everyday� By law, they need to read each of them

� A few rules attracts hundreds of thousands emails per year

� Government employees needs to quickly overview the “lay of the land”

2

Ph.D. Defense, Nov 9, 2011

Page 2: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Organizing Comments

3

Ph.D. Defense, Nov 9, 2011“Protect polar bear” (USDOI-FWS-2007-0008)

Blue Links Organized Information

4Ph.D. Defense, Nov 9, 2011

Page 3: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Why Search Engines Aren’t Enough?

5

LookupLookup InvestigateInvestigateLookupLookup InvestigateInvestigate

Fact retrieval

Known item searchRevisiting pages [50-80%;

Teevan et al.08]

Verification

Question answering

Knowledge acquisition

Comprehend/InterpretCompare

Aggregate/Integrate

Socialize

Accrete

AnalysisExclude

Synthesis

Evaluation

Discovery

Plan/ForecastTransform

LearnLearnLearnLearn

Customized from a Slide by Gary Marchionini

Where search engines

invest most of their resources

Where people invest most

of their time in web search

Where it needs to improve

Ph.D. Defense, Nov 9, 2011

This thesis explores this new task, which

� Identifies concepts discussed in a set of documents;

� Organizes these concepts into an ontology;

�Or you want to call it taxonomy, concept hierarchy

� And in your way.

6Ph.D. Defense, Nov 9, 2011

Page 4: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Why Not Existing Ontologies?

� Do not contain your vocabulary

� Do not customize the ontology to your needs

7Ph.D. Defense, Nov 9, 2011

What Should the Structure Look Like?

8

New Medicine

Bypass

Surgery

Blood Clots

Enough Sleep

Exercise

Mood Change

Heart Attack

Greasy Food

Healthy Food

VegetableFruit

Blood Pressure

Blood Sugar

causes

is-a

reduces

reduces

co-exist

reduces

reduces

Diabetes

causes

increases

Vessel Narrowness

Angioplasty

co-exist

increases

is-a

is-ais-a

antonym

reduces

increases

reduces

causes

causesreduces

causes

treats

treats

Heart AttackCauses

Self-help

Medical Treatment

Blood PressureMood Change

Diabetes

Blood Clots

Vessel Narrowness

Blood Sugar

New Medicine

Surgery

Angioplasty, Bypass

Healthy Food

Enough Sleep

Exercise

Vegetable, Fruit

Heart Attack

Medical Treatment

High Blood Pressure

Diabetes

Blood Clots

Surgery

Healthy Food

Enough Sleep

Exercise

Vegetable, Fruit

Healthy Life Style

Away from

Ph.D. Defense, Nov 9, 2011

Tribble and Rose. Useable browser for ontological Knowledge acquisition. (CHI 2006)

Page 5: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

What Should the Structure Look Like?

� Connections:

� Different views in databases

� Faceted search in information retrieval

9

Heart AttackCauses

Self-help

Medical Treatment

Blood PressureMood Change

Diabetes

Blood Clots

Vessel Narrowness

Blood Sugar

New Medicine

Surgery

Angioplasty, Bypass

Healthy Food

Enough Sleep

Exercise

Vegetable, Fruit

Heart Attack

Medical Treatment

High Blood Pressure

Diabetes

Blood Clots

Surgery

Healthy Food

Enough Sleep

Exercise

Vegetable, Fruit

Healthy Life Style

Away from

Ph.D. Defense, Nov 9, 2011

Formally, Personal Ontology is Defined as

10

concept

concept

concept

concept concept

concept

Ph.D. Defense, Nov 9, 2011

Concepts: {c1, c2, …,cn}

Relations:{r(c1, c2), r(c1, c3), …}

Domain Manual Guidance;Personal preferences

Page 6: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

11

How to Construct a Personal Ontology?

Subtask1: Extracting Concepts

Subtask2: Identifying Relations

CIKM ONISW08a, Dg.O 2008.

ACL09, IEEE Intelligent Systems 09, SIGIR09p, HCIR08, CIKM ONISW08a. CIKM ONISW08b, Dg.O 2008.

Concept

ConceptConceptConcept

ConceptConcept

Concept

Concept ConceptConcept

Concept

Concept

automated interactive

Ph.D. Defense, Nov 9, 2011

This Talk Presents

� A general ontology learning framework

� Put human in the loop

� Efficient hierarchy similarity measure – FBS

� Study of user behaviors

12Ph.D. Defense, Nov 9, 2011

Page 7: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

How to Automatically Build Ontologies?

� Clustering

� Similarity decided by:

� Context1

� Co-occurrence2

� Examples: Yippy(Clusty), topic models as in Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI)

� Strength: Intuitive; Weakness: Non-interpretable clusters

� Patterns3,4,5,6,7,8

� Syntactic & Semantics of Natural Language

� Examples: Hearst Patterns3, Double-anchored8, NELL@CMU10

� Strength: Accurate; Weakness: Low coverage

13

, or other , , and other …

1. Pantel and Ravichandran 04. 2. Snow Jurafsky, and Ng 06.3. Hearst92.4. Snow et al. 05.5. Pantel et al.04. 6. Roark and Charniak 98. 7. Davidov and Rappoport.06. 8. Kozareva et al. 08.9. Etzioni et al.05.10. Mitchell et al. 10.

Ph.D. Defense, Nov 9, 2011

Clustering vs. Patterns vs. What We Want

14Ph.D. Defense, Nov 9, 2011

Heart AttackCauses

Self-help

Medical Treatment

Blood PressureMood Change

Diabetes

Blood Clots

Vessel Narrowness

Blood Sugar

New Medicine

Surgery

Angioplasty, Bypass

Healthy Food

Enough Sleep

Exercise

Vegetable, Fruit

Surgery

Angioplasty

Bypass

Disease

Heart Disease

Diabetes

Healthy Food

Vegetable

Fruit

Food

Enough Sleep

Heart Attack

Self-help

Medical Treatment

Blood Pressure

Mood Change

Blood Clots

Vessel Narrowness

Blood Sugar

New Medicine

Exercise

Heart Attack

Causes

Self-help

Medical Treatment

Blood Pressure

Mood Change

Diabetes

Blood Clots

Vessel Narrowness

Blood Sugar

New Medicine

Surgery

Angioplasty, Bypass

Healthy Food

Enough Sleep

Exercise

Vegetable, Fruit

?

?

?

Page 8: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

How will a human organize concepts?

� Human solution:

� Form small & accurate fragments

� Examine the remaining concepts one by one

� Look for the best place for a concept

� We take a similar approach!

15Ph.D. Defense, Nov 9, 2011

Zooming in: Pair-wise Semantic Distances

� Many techniques … …

16

Context

Co-occurrence

… , and other …

… consists of …

Clustering Pattern

KL Divergence in Google snippets

KL Divergence

in Wikipedia

…, …, or other …

… is a …

…, including …

Edge distance in parse tree

Word Length Difference

Others

Overlaps in Definition

Overlaps in Modifier

� They are all good

� So we decide to use all of them!

� … by providing a general framework

� Transform each technique into a feature

� Weighted combination of the features

� Learning the weights from training data

�WordNet, ODPPh.D. Defense, Nov 9, 2011

Page 9: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Weighted Combination of Feature Function Values as the Pair-wise Distance

17

),( yx ccd

),( ),( 1

yx

T

yx ccfeaturesWccfeatures−

| |

Patterns

Syn. Pars. Tree

Context

Co-occurrence

Definition

Word Length

Weight Matrix

Learned from

Training Data

Ph.D. Defense, Nov 9, 2011

Mahalanobis distance

W≥0, positive semi-definite to ensure triangular inequality

cy

cx

Best Possible Position for a Concept

� Connections:

� Minimum evolution principle in biology

� Minimum spanning tree in graph theory

� When a concept arrives,

� Its insertion should give the least increase to the overall semantic distance in the ontology

� Why this is true?

� Correct position = small distances to neighbors

� Wrong position = big distance to neighbors

� Minimize overall semantic distance in the ontology

18

Minimum Evolution

Ph.D. Defense, Nov 9, 2011

Page 10: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Minimum Evolution

The Optimal Ontology is One that Introduces Least

Increase to Overall Semantic Distance

),(minarg '0

'TTT

T∆=

),(minarg '1

'TTT n

T

n∆=

+

19Ph.D. Defense, Nov 9, 2011

Minimum Evolution (An Example)

20

Relation: is-a

e.g., Apple is-a Fruit

Fruit is-not-an Apple

0.3

Ph.D. Defense, Nov 9, 2011

Game Equipment

Overall Distance = 0.3

Page 11: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Minimum Evolution (An Example)

21

Relation: is-a

e.g., Apple is-a Fruit

Fruit is-not-an Apple

dist(“ball”, ) = 0.27

dist(“Game Equipment”,“ball”) = 0.1

dist(“ball”, “Game Equipment”) = 3

dist( , “ball”) = 12

Overall Distance = 12.3

0.3

12

Ph.D. Defense, Nov 9, 2011

Game Equipment

ball

Minimum Evolution (An Example)

22

Relation: is-a

e.g., Apple is-a Fruit

Fruit is-not-an Apple

Overall Distance = 0.4

0.3

0.1

Ph.D. Defense, Nov 9, 2011

dist(“ball”, ) = 0.27

dist(“Game Equipment”,“ball”) = 0.1

dist(“ball”, “Game Equipment”) = 3

dist( , “ball”) = 12Game Equipment

ball

Page 12: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Minimum Evolution (An Example)

23

Relation: is-a

e.g., Apple is-a Fruit

Fruit is-not-an Apple

Overall Distance = 0.370.1

0.27

min

Ph.D. Defense, Nov 9, 2011

dist(“ball”, ) = 0.27

dist(“Game Equipment”,“ball”) = 0.1

dist(“ball”, “Game Equipment”) = 3

dist( , “ball”) = 12Game Equipment

ball

Minimum Evolution (An Example)

24Ph.D. Defense, Nov 9, 2011

table

Game Equipment

ball

Page 13: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Concerns

� Order of the insertions

� Small ontologies: Random restarts

� Big ontologies: Partial random restarts for recent arrivals

� Search space is big

� Constrain the ontology candidates

� Constraints come from a good understanding of the characteristics of a personal ontology

� Concept abstractness

� Long distance concept coherence

25Ph.D. Defense, Nov 9, 2011

Concept Abstractness

26

Mo

re A

bstra

ct

Mo

re C

on

cre

te

things to discuss

global

warming

issues actions

pollution policies

causes

CO2

impact

animal

death

polar bear seal wolf

severe

weather

EPA

rules

DOT

rules

reduce

power plant

reduce

emission

flood

Ph.D. Defense, Nov 9, 2011

Page 14: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Concept Abstractness

27

Each abstraction level has its own distance function

Ph.D. Defense, Nov 9, 2011

Long Distance Coherence

28

car

sportsedan

swim ball

games athletics

football baseball basketball

BMW

things

to buy

things

to work on

tenurepurse

good

teaching

I see myself in 5 years

good

research

great

ideas

hard

work

Ph.D. Defense, Nov 9, 2011

Page 15: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Long Distance Coherence

29

car

sportsedan

swim ball

games athletics

football baseball basketball

BMW

things

to buy

things

to work on

tenure

good

teaching

I see myself in 5 years

good

research

great

ideas

hard

work

purse

Ph.D. Defense, Nov 9, 2011

Long Distance Coherence

30

car

sportsedan

swim ball

games athletics

football baseball basketball

BMW

things

to buy

things

to work on

tenure

good

teaching

I see myself in 5 years

good

research

great

ideas

hard

work

purse

Ph.D. Defense, Nov 9, 2011

Page 16: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Long Distance Coherence

31

car

sportsedan

swim ball

games athletics

football baseball basketball

BMW

things

to buy

things

to work on

tenure

good

teaching

I see myself in 5 years

good

research

great

ideas

hard

work

purse

Ph.D. Defense, Nov 9, 2011

Long Distance Coherence

32

car

sportsedan

swim ball

games athletics

football baseball basketball

BMW

things

to buy

things

to work on

tenure

good

teaching

I see myself in 5 years

good

research

great

ideas

hard

work

Each root-to-leaf path is coherent;

Overall distances in a path should be

minimized.

purse

Ph.D. Defense, Nov 9, 2011

Page 17: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Multi-Criterion Optimization

33

Minimum

Evolution

objective

Coherence

objective

Abstractness

objective

Ph.D. Defense, Nov 9, 2011

Evaluation

� Task: Reconstruct ontology fragments

� Datasets: � 50 hypernym ontology fragments from WordNet

� gathering, professional, people, building, place, milk, meal, …

� 50 hypernym ontology fragments from ODP� computers, robotics, intranet, mobile computing, database, …

� 50 meronym ontology fragments from WordNet� bed, car, building, lamp, earth, television, body, drama, …

� Evaluation Metrics: Precision, Recall, and F1-measure for parent-child pairs� averaged by 50 Leave-One-Out cross validation

34Ph.D. Defense, Nov 9, 2011

Page 18: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Comparison to State-of-the-art

35

System Precision Recall F1

Hearst 1992 0.85 0.32 0.46

Girju et al. 2003 - - -

Snow et al. 2006 0.75 0.73 0.74

Our Approach 0.82 0.79 0.82

WordNet is-aSystem Precision Recall F1

Hearst 1992 0.31 0.29 0.30

Girju et al. 2003 - - -

Snow et al. 2006 0.60 0.72 0.64

Our Approach 0.64 0.70 0.67

ODP is-a

System Precision Recall F1

Hearst 1992 - - -

Girju et al. 2003 0.75 0.25 0.38

Snow et al. 2006 0.68 0.52 0.57

Our Approach 0.69 0.55 0.61

WordNet part-of

Ph.D. Defense, Nov 9, 2011

Features vs. Relations

36

Feature Is-a Sibling Part-of Benefited Rel.

Co-occurrence 0.48 0.41 0.28 All

Pattern 0.46 0.41 0.30 All

Contextual 0.21 0.42 0.12 Sibling

Syntactic 0.22 0.36 0.12 Sibling

Word Length 0.16 0.16 0.16

Definition 0.12 0.18 0.10

All 0.82 0.79 0.61 All

Best Features Co-occurrence,

Pattern

Contextual,

Co-occurrence,

Pattern,

Syntactic

Co-occurrence,

Pattern

Metric: F1. WordNet

Ph.D. Defense, Nov 9, 2011

Page 19: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Features vs. Abstractness

37

Feature Level 2 Level 3 Level 4 Level

5

Level 6

Co-occurrence 0.47 0.56 0.45 0.41 0.41

Pattern 0.47 0.44 0.42 0.39 0.40

Contextual 0.29 0.31 0.35 0.36 0.36

Syntactic 0.31 0.28 0.36 0.38 0.40

Word Length 0.16 0.16 0.16 0.16 0.16

Definition 0.12 0.12 0.12 0.12 0.12

Metric: F1. WordNet /is-a

Ph.D. Defense, Nov 9, 2011

Features vs. Abstractness

38

Feature Abstract Concepts Concrete Concepts

Co-occurrenceGood Good

Pattern

ContextualBad Good

Syntactic

Word LengthBad

Definition

Metric: F1. WordNet /is-a

Ph.D. Defense, Nov 9, 2011

Page 20: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Outline

� A general ontology learning framework

� Put human in the loop

� Efficient hierarchy similarity measure – FBS

� Study of user behaviors

39Ph.D. Defense, Nov 9, 2011

Put Human in the Loop

� Purpose: Customize the Ontology to Suit Individual Needs

� Collect guidance from human

� Guidance in a representation that can be understood by machine

40Ph.D. Defense, Nov 9, 2011

Page 21: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

OntoCop (Ontology Construction Panel)

41

Ph.D. Defense, Nov 9, 2011

After a Few Human Edits – Interact!

42

Ph.D. Defense, Nov 9, 2011

Page 22: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

OntoCop Makes Suggestions

43

Ph.D. Defense, Nov 9, 2011

Matrix Representation for Ontology

� Before human edits

� Before Matrix

� After human edits

� After Matrix

44

10000

01000

00110

00110

00001

10000

01100

01100

00010

00001person

leader

president

prime minister

Obama

person

leader

president

prime minister

Obama

person

leader president

prime minister Obama

person

leader

presidentprime minister

Obama

Ph.D. Defense, Nov 9, 2011

Page 23: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Manual Guidance

45

10000

01000

00110

00110

00001

Before Matrix After Matrix

10000

01100

01100

00010

00001

Different rows

Different columns

110

110

001

Manual Guidance

Ph.D. Defense, Nov 9, 2011

How to Incorporate Manual Guidance

� Nearest neighbors

� Find the most similar pairs (nearest neighbors) to the manual guidance, & predict accordingly

�Why not

� Conflicts among multiple guidance’s predictions

� Lost transitivity of distance

� Using our ontology learning framework !

46Ph.D. Defense, Nov 9, 2011

Page 24: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Manual Guidance as Training Data

47

( )

0 subject to

),(),(min||

1

||

1

2)()(1)()()(

)( )(

=

−∑∑= =

fW

ccfeaturesWccfeaturesd

i iG

x

G

y

i

y

i

x

Ti

y

i

x

i

xyW

Training Data

Manual Guidance

WordNet

ODP

Smoothing

Ph.D. Defense, Nov 9, 2011

Update the Ontology

� Predict Distance Scores for Unmodified Concepts

� Organize concepts in the updated ontology

�When is small (<0.5), the relation between

is true;

� The relation can be of any type

� but one relation in one ontology

48

),(),( )1()1(1)()1()1()1( ++−+++=

i

m

i

l

iTi

m

i

l

i

lm ccfeaturesWccfeaturesd

)1( +i

lmd )1()1( , ++ i

m

i

l cc

Ph.D. Defense, Nov 9, 2011

Page 25: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

User Study

� Task: Build personal ontologies from a set of given concepts and documents

� 20 Datasets

� 10 NAICS (North America Industry Classification System)� Information, health care, administrative services, professional services, finance, construction, public administration, …

� 5 Web � Find a good kindergarten, buy a used car, plan a trip to DC, make a cake, and find a wedding videographer, …

� 5 Public Comments� protect polar bear, protect wolf, mercury pollution, transportation registration fee, and national organic program, …

49Ph.D. Defense, Nov 9, 2011

User Study

� A within-subject study for 24 grad & undergrad students

� Procedure:� Start with a tool training

� Everyone did both manual and interactive ontology construction for the testing tasks

� Questionnaire: dataset difficulty, system learning ability, editquality, compare manual vs. interactive, etc

� 12 participants repeated the tasks after 3 weeks

50Ph.D. Defense, Nov 9, 2011

Page 26: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Accuracy of OntoCop’s Suggestions

51

accuracy

=# accepted suggestions

# total suggestions

Ph.D. Defense, Nov 9, 2011

� The accuracy of suggestions is high across all datasets

Accuracy of OntoCop’s Suggestions

52

Better

Ph.D. Defense, Nov 9, 2011

Page 27: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

More Results

� Efficiency: OntoCop save 20% time (p<.001), 25% edits (p<.001) per dataset on average than manual runs

� Compare to reference ontologies: OntoCop produces ontologies more similar (0.82) to reference ontology than manual (0.74)

� Dataset difficulty:

� correlates to dataset type - NAICS>Web,Comments

� more difficult dataset � longer construction time, less confidence

in human edits, lower self-consistency

� … and more on user behaviors

53

Outline

� A general ontology learning framework

� Put human in the loop

� Efficient hierarchy similarity measure – FBS

� Study of user behaviors

54Ph.D. Defense, Nov 9, 2011

Page 28: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Fragment View of Hierarchies

55

Vocabularygame equipment, ball, basketball, volleyball, soccer, tennis racket, table, table-

tennis table, snooker table, badminton racket, tennis table, football, hockey ball

Ph.D. Defense, Nov 9, 2011

BOW Representation of Hierarchies

56

game equipment: (0,1,1,1,1,1,1,1,1,0,0,0,0).ball: (0,0,1,1,1,0,0,0,0,0,0,0,0).table: (0,0,0,0,0,0,0,1,1,0,0,0,0).

game equipment: (0,1,0,0,0,0,1,1,1,1,1,1,1).ball: (0,0,0,0,0,0,0,0,0,0,1,1,1).table: (0,0,0,0,0,0,0,1,1,0,0,0,0).

Ph.D. Defense, Nov 9, 2011

Page 29: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Fragment-Based Similarity (FBS)

57

game equipment: (0,1,1,1,1,1,1,1,1,0,0,0,0).ball: (0,0,1,1,1,0,0,0,0,0,0,0,0).table: (0,0,0,0,0,0,0,1,1,0,0,0,0).

game equipment: (0,1,0,0,0,0,1,1,1,1,1,1,1).ball: (0,0,0,0,0,0,0,0,0,0,1,1,1).Table: (0,0,0,0,0,0,0,1,1,0,0,0,0).

∑=

=m

p

jpipji ttsimD

TTFBS1

cos ),(1

),(

# NodesMatched by Highest

Cosine Similarity Value

Much faster (O(n3)) than tree edit distance (NP-

hard)

Outline

� A general ontology learning framework

� Put human in the loop

� Efficient hierarchy similarity measure – FBS

� Study of user behaviors

58Ph.D. Defense, Nov 9, 2011

Page 30: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

� Self-consistency is high (>0.75)

� Correlation with the dataset type (p<.001) and construction method (p<.05)

Come Back to the User Study: Self-Consistency

59

FB

S

Better

Ph.D. Defense, Nov 9, 2011

Users form two clusters

60

Longer construction time,Less self-consistent

Less construction time,More self-consistent

Ph.D. Defense, Nov 9, 2011

(difference~3.5 mins; p<0.001, two-way ANOVA tests)

Page 31: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Feature Use vs. Users

61

Ph.D. Defense, Nov 9, 2011

Who were in these user clusters?

62Ph.D. Defense, Nov 9, 2011

Page 32: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Who were in these user clusters?

63Ph.D. Defense, Nov 9, 2011

Concluding Remarks

64Ph.D. Defense, Nov 9, 2011

Page 33: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Contributions

� A general ontology learning framework, which

�allows a wider range of features/technologies

�allows different metric functions

� for concepts at different abstraction levels

� for different types of relations

� ensures long distance concept coherence

� distinguishes concept abstractness

65Ph.D. Defense, Nov 9, 2011

Contributions (cont.)

� Put human seamlessly in the loop� Little interruption to user experience

� Bring personality to static, machine-generated ontologies

� Enable a range of new applications for task-specific information organization� Decision engine for task-oriented search

� Specialty-specific/Doctor-specific medical records

� Literature reviews

66Ph.D. Defense, Nov 9, 2011

Page 34: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Contributions (cont.)

� Efficient hierarchy similarity measure – FBS�Well-approximates tree edit distance

�… and runs in polynomial time

� Study of user behaviors� Users are self-consistent in ontology construction

� Users naturally form two groups

� Patten-lovers

� Broad thinkers

67Ph.D. Defense, Nov 9, 2011

Thank You

Grace Hui YangLanguage Technologies Institute

School of Computer Science

Carnegie Mellon University

[email protected]

http://www.cs.georgetown.edu/~huiyang

68Ph.D. Defense, Nov 9, 2011

Page 35: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Supplementary Slides

69

Ph.D. Defense, Nov 9, 2011

70

Features

Ph.D. Defense, Nov 9, 2011

Page 36: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Lexico-Syntactic Patterns

� … _ is a/an _ …

� … _ and/or other …

� … ___ such as _ …

� … such ____ as _ …

� … ____ including _ …

� … ____ , especially _ …

71Ph.D. Defense, Nov 9, 2011

Syntactic Dependency Features

� Minipar Syntactic Distance = Average length of syntactic paths in syntactic parse trees for sentences containing the terms;

� Modifier Overlap = # of overlaps between modifiers of the terms; e.g., red apple, red pear;

� Object Overlap = # of overlaps between objects of the terms when the terms are subjects; e.g., A dog eats apple; A cat eats apple;

� Subject Overlap = # of overlaps between subjects of the terms when the terms are objects; e.g., A dog eats apple; A dog eats pear;

� Verb Overlap = # of overlaps between verbs of the terms when the terms are subjects/objects; e.g., A dog eats apple; A cat eats pear.

72Ph.D. Defense, Nov 9, 2011

Page 37: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Co-occurrence & Contextual

� Co-occurrence� Point-wise Mutual Information (PMI)

� = # of sentences containing the term(s);or # of documents containing the term(s);

or n as in “Results 1-10 of about n for …” in Google.

� Contextual Features� Global Context KL-Divergence = KL-Divergence(1000 Google Documents for Cx , 1000 Google Documents for Cy);

� Local Context KL-Divergence = KL-Divergence(Left two and Right two words for Cx , Left two and Right two words for Cy).

73Ph.D. Defense, Nov 9, 2011

Definition & Word Length

� Definition Overlap = # of non-stopword overlaps between Web definitions of two terms.

� Word Length Difference

74Ph.D. Defense, Nov 9, 2011

Page 38: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

More Experiments

75Ph.D. Defense, Nov 9, 2011

Impact of Using Concept Abstractness

76Ph.D. Defense, Nov 9, 2011

Page 39: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Impact of Using Concept Coherence

77Ph.D. Defense, Nov 9, 2011

Perceived System Learning Ability

78Ph.D. Defense, Nov 9, 2011

Page 40: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Construction Time

79Ph.D. Defense, Nov 9, 2011

Number of (Human) Edits

80Ph.D. Defense, Nov 9, 2011

Page 41: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Comparison to Reference Ontology

81Ph.D. Defense, Nov 9, 2011

82Ph.D. Defense, Nov 9, 2011

Page 42: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Commonality/Differences among ontologies built by different people

83Ph.D. Defense, Nov 9, 2011

More Definitions

84Ph.D. Defense, Nov 9, 2011

Page 43: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Full Ontology

ball table

Game Equipment

GroupedConceptSet={ga

me equipment, ball, table,

basketball, volleyball,

soccer, table-tennis table,

snooker table}UngroupedConceptSet={}

85

Ph.D. Defense, Nov 9, 201185

Ontology Metric

distance = 1.5 distance = 2

distance =1distance =1

d( , ) = 2

d( , ) = 1 ball

d( , ) = 4.5 table

ball

Game Equipment

table

86

∑∈

=

),(

),( )(),(

kjPjke

jkwT ewkjd

Ph.D. Defense, Nov 9, 201186

Page 44: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

More Applications

87Ph.D. Defense, Nov 9, 2011

Real-time Email & Discussion Monitoring

Ph.D. Defense, Nov 9, 201188

Communication

Page 45: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Virtual Medical Records

� Specific views from patient data for each specialty/doctor

� Save time on digging out medical history� Less need of physical appointments

� Able to handle more patients

89Ph.D. Defense, Nov 9, 2011

Health Care

Abstractness

Page 46: Personal Ontology Learning - Georgetown Universityinfosense.cs.georgetown.edu/publication/slides/defense-talk.pdf · Ph.D. Thesis Defense Talk 1 Notice Comment Rulemaking U.S. regulatory

Tree of Porphyry – John Sowa

Geographical categories in the Chat-

80 system – John Sowa