89
Knowledge Acquisition in a System Christopher Thomas Ohio Center of Excellence in Knowledge-enabled Computing - Kno.e.sis, Wright State University Dayton, OH [email protected]

PhD thesis defense of Christopher Thomas

Embed Size (px)

DESCRIPTION

Christopher Tomas defended his thesis on "Knowledge Acquisition in a System". Video can be found at: http://www.youtube.com/watch?v=NeQomGsJvDk

Citation preview

Page 1: PhD thesis defense of Christopher Thomas

Knowledge Acquisition in a System

Christopher ThomasOhio Center of Excellence in Knowledge-enabled Computing

- Kno.e.sis, Wright State University

Dayton, [email protected]

Page 2: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Circle of knowledge in a System

2

Page 3: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 3

Dissertation OverviewConceptual Knowledge: Ontologies, LoD

Doozer++:Taxonomy extraction Relationship/Fact extraction [IHI, WebSem1, IEEE-IC, WebSci, WI1]

Information Quality[WI2]Social processes for content creation [CHB]

Textual Information: Wikipedia, Web

Knowledge merging/Ontology alignment [AAAI, WebSem2, SWSWPC]

Social processes for knowledge validation [IHI,WebSci, CHB]

Knowledge Representation [IJSWIS, CR, FLSW]Ontology design [WWW, FOIS]

33

Page 4: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 4

Talk Contents

What is knowledge?

How do we turn propositions/beliefs into knowledge?

How do we acquire information?

Page 5: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 5

Talk outline

• Motivation• Knowledge Acquisition (KA) Overview• KA in a loosely connected system – Doozer++

– Automatic formal domain model creation– Information Extraction

• Top-Down• Bottom-Up

– Information Validation “in use”• Conclusion

Page 6: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Larger Context of automated KA

• Increasing significance of knowledge economy– “Knowledge Workers” spend 38% of their time

searching for information (McDermott, 2005)– Vital to get a quick and still comprehensive

understanding of a field through pertinent concepts/entities and relations/interactions

• Increased demand for formally available knowledge in semantic models– Filtering, browsing, annotation, reasoning

Mcdermott, M. "Knowledge Workers: How can you gauge their effectiveness." Leadership Excellence. Vol. 22.10. October 2005

Page 7: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• Learn about a new subject– E.g. gain a quick overview over a current or

historical event• Use a formal representation of the gained

overview to filter information– Facilitate in-depth exploration

• Use the formalized information and the user interaction to create knowledge from information

7

Page 8: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• Google: India

• Brief description – demographic-, geographic information, etc.

8

Page 9: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• Google: India

• Regular Web results

9

Page 10: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• Clicking on a link to the Wikipedia entry shows that there have been conflicts with Pakistan over the region of Kashmir

Investigate more

10

Page 11: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• Google: India Pakistan Kashmir

• Only Web results and news

So far, search engines only display facts about entities, not relationships or larger contexts

11

Page 12: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• Beneficial to get an overview “at a glance” over a domain.

• Automated approach to creating knowledge models for focused areas of interest

• Create models around an incomplete or rudimentary keyword description and “anticipate” user’s intentions wrt. the full context

12

Page 13: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

Doozer++: india pakistan kashmir• Important concepts and relationships

describing the context

13

Page 14: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• Filtered IR using concepts in the model

• Concepts and relationships that contributed to clicked results gain support

• User can explicitly approve content

14

Page 15: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 15

Circle of Knowledge (Example)

Page 16: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Motivating Scenario

• On-demand creation of domain knowledge improves individual comprehension of an event

• Formal models are easy to use in information filtering

• Validated information Knowledge– Can be given back to the community to

improve the overall amount of formal knowledge available on the Web

– E.g. “Unknown” to DBPedia that the region of Kashmir belongs to both India and Pakistan

16

Page 17: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Importance of Model creation

• Models support individual user or know-ledge worker, but also groups or system– More efficient communication through small,

shared, agreeable conceptualizations• People people• People system• System system

– Classify or filter pertinent and topical information using models

– Model-assisted searching and faceted or exploratory browsing using relationships

– Reuse of validated knowledge

Page 18: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Knowledge Models

• Scientific applications– In-depth description of concepts– Narrow field– People system, system system

• Annotation, reasoning⇒Absolute correctness necessary (as far as possible)

• General applications– Broad coverage of the field– Context – how does the new information fit in?– People people, people system

• Individual domain comprehension, filtering, annotation⇒Relative correctness sufficient

18

Page 19: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Model Creation Resources

• Large models are available as reference– DBPedia, YAGO, UMLS, MeSH, GO …– Too big to be efficiently and effectively usable

• Prior knowledge required to find pertinent resources

• Other information is available in great abundance, but unformalized– Tacit expert knowledge– Scientific databases– Free text

• peer reviewed journals and proceedings• General Web content

19

Page 20: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Epistemological Considerations

• Knowledge– Ensure epistemological soundness of

automated knowledge acquisition• Reference

– Ensure that nodes in the models refer to real-world concepts/entities

20

Page 21: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Knowledge

• Functional Definition– Knowledge = “Know-How” – Practical, but weak,

Includes “Actionable Information”• Categorical Definition

– Knowledge = Justified true belief– S knows that p iff

i. p is true;

ii. S believes that p;

iii. S is justified in believing that p.

Page 22: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Belief and Justification

• Belief– Statements held by the system

• Justification– Trusted sources– Extraction algorithms

• Bayesian, deductive or inductive reasoning• Macro-Reading algorithms Wisdom of the crowds

– Validation

22

Page 23: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Truth assessment of a statement

• Is truth correspondence?– “A” is true iff A (a true statement corresponds

to an actual state of affairs)• Is truth coherence?

– Does the statement fit into the system of other statements?

• Is truth consensus?– agreement of correctness amongst a group

⇒In the cyclical model, achieve high degree of certainty by allowing constant validation

No Access

Page 24: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Model – Reference

• Model of a domain conceptually split– Domain Definition

Concepts identified by URIs (classes, entities, relationship types) ensures reference

Remains static – necessityRigid designators (Kripke)

– Domain DescriptionRelationships describe concepts Subject to change – possibilityDefinite descriptions (Russell)

Page 25: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Definition

• Top-down concept identification• Achieved through

– Manual creation based on consensus in a group

– Extraction from community-created or peer-reviewed conceptualization• Wikipedia• MeSH or UMLS Semantic Network

Page 26: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Description

• Possible to do top-down extraction of the domain description, e.g. from DBPedia

• Problem: Formal concept descriptions are sparse– On average, DBPedia has less than 2 object

properties per entity• Extract descriptions (facts) bottom-up

– Available in text, DBs, etc.– Domain-specific molecular structure extractors

(GlycO)– Domain independent IE techniques (Doozer++)

Page 27: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Knowledge Acquisition Approaches

• KA in a tightly connected system– GlycO: domain-specific BioChemistry ontology

• Manual domain definition and description• Partial automatic domain description• Domain-specific automatic validation• Manual validation for false negatives

• KA in a loosely connected system– Doozer++: general domain-model creation framework

• Automatic domain definition, top-down concept extraction• Automatic domain description, bottom-up fact extraction

– Extraction from trusted sources– A trusted extraction and validation procedure

• Domain-independent community-based validation

Page 28: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Knowledge Acquisition Approaches

Knowledge Engineering Approach

Traditional Extraction Approach

GlycO Doozer++

Definition Top-Down Bottom-up Top-Down Knowledge Engineering

Top-DownConceptually, by extraction from Top-Down corpus

Description Top-Down Bottom-up Bottom-up, restricted by Top-down definition

Bottom-up,restricted by Top-down definition

Verification Manual Manual Correctness: automatic:Exceptions: added manually

Community-based validation

Page 29: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

KA on the Web - Vision

• Web searches, browsing sessions or classification task can be seen as creating an implicit domain model– World view, Concept coverage, Facts

• Make models explicit and reusable using formal descriptions (RDF, OWL)

• Validate the contained information and share with the community

Increase system’s knowledge by “doing what you do”: Search, browse, click, communicate

29

Page 30: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 30

KA in a Loosely Connected System

Scooner Evaluation in Use:Semantic browsing and retrieval, Domain-independent,Community-based

Doozer++– Domain Definition:

Top-down concept extraction

– Domain Description: Pattern-based fact extraction

• Linked Data

• Free text• Wikipedia• Web

Domain Model creation to gradually increase overall knowledge of the system• User-interest driven • Incentive to

evaluate

Domain Definition

Domain Description

Validation

Page 31: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 31

• Identify concepts, concept labels (denotations) and concept hierarchy

• Challenge: define narrow boundaries for a domain while at the same time ensuring broad conceptual coverage within the domain

Domain Definition Requirements

Page 32: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain definition - conceptual

• Expand and Reduce approach– Start with ‘high recall’ methods

• Exploration – Full text search• Exploitation – Graph-Similarity Method• Category growth• “What could be in the domain?”

– End with “high precision” methods• Apply restrictions on the concepts found• Remove terms and categories that fall outside the

dense areas of the model graph• “What should be in the domain?”

32

Page 33: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Description - Classifier

• Concept-aware– Use concepts and concept labels from the

domain definition step • Fact extraction as classification of

concept pairs into relationship types– fclass: C C R– RS,O = {R | p(R,S,O) > ε}

Page 34: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Description

• Combined Language model and Semantic classification model

• Language model: Surface-pattern – based– Pattern manifestations of relationships as

features– Open to any corpus, language independent– Less computational overhead than NLP

• Semantic Classification Model– Learned or assigned concept labels– Semantic types to aid classification

Page 35: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Description - Implementation

• Probabilistic Vector-space model– Each relationship is defined by vectors of

• Pattern probabilities• Domain/range probabilities

– Each concept is grounded by its semantic types and manifested by it’s labels and their probabilities of identifying the concept

– Sparse pattern representation (density ~2%)– White-box, easily verifiable– Inherently parallel

Page 36: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Terminology

36

Symbol Meaning Example

S, O Subject and Object concepts (semantic)

Kelly_Miller_(scientist)Howard_University

LS,LO Subject and Object labels

“Kelly Miller”“Howard University”

PLS,LOPhrase instantiating the pattern

Kelly Miller graduated from Howard University

P Pattern <Subject> graduated from <Object>

TS,TO Semantic type of Subject or Object

PersonEducational_Institution

R relationship almaMaterbirthPlace

Page 37: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 37

Probabilistic Classifier

Labels taken from Lexicon

or linked corpus

Patterns learned from

free text

Semantic types. Asserted in Ontology or learned from linked data

Page 38: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 38

Probabilistic Classifier

  Obama graduated in 1983 from Columbia University with a degree in political science and international relations.

p(R, Barack_Obama, Columbia_University)

How is Barack Obama related to Columbia University?

Sentence in corpus:

(Regular classification requires multiple examples)

Page 39: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 39

Probabilistic Classifier

p(almaMater ,Barack_Obama, Columbia_University) =

p(almaMater | “<Subject> graduated in 1983 from <Object>”) *

p(Barack_Obama | ”Obama”) *

p(Columbia_University | ”Columbia University”) *

p(almaMater | domain(person)) *

p(almaMater | range(academic_institution))

p(almaMater , Barack_Obama, Columbia_University) = 0.9 * 0.95 * 0.95 * 0.9 * 0.97

p(almaMater, Barack_Obama, Columbia_University) = 0.70909425

Obama graduated in 1983 from Columbia University

Page 40: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Pattern Generalization

• Problem: Low recall in pattern-based IE• Substitute terms with wild cards

– No POS tagging, hence only “*” wild cards• Mirrors shortest paths through parse trees

40

<Subject> graduated in 1983 from <Object>

<Subject> * in 1983 from <Object>

<Subject> graduated * 1983 from <Object>

<Subject> * * 1983 from <Object>

<Subject> graduated in * from <Object>

<Subject> * in * from <Object>

<Subject> graduated * * from <Object>

<Subject> * * * from <Object>

Page 41: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Learning p(R|P)

• Distantly Supervised Training• Collect pattern frequencies for training

examples– Fact triples <S, R, O> e.g. from Linked Data

(DBPedia, UMLS)– Manifestations of facts in text in the form of

patterns (corpus e.g. Web, Wikipedia, MedLine)

• For relationship Ri, aggregate pattern vectors representing <*, Ri, *>

41

Page 42: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Learning p(R|P) – naïve

• For each vector Ri containing pattern frequencies for relationship Ri, compute

• #Patternj that occur with terms denoting each <S, O> Ri in normalized by all pattern

occurrences for Ri

42

Page 43: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Learning p(R|P) – naïve

• Uniform distribution of relationships assumed– As the number of relationship types grows), the

prior of each type goes towards 0.– normalize the probabilities over the column

vector to get p(Ri|Pj)

• Vector space representation– Relationship-pattern matrix– R2Pij = p(Ri|Pj)

43

Page 44: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Problem: Relationship Similarities

• Extensional similarity– Semantically different relationships can share

Subject-Object pairs in training data• Intensional similarity

– Overlap and entailment of relationship types• Types should not be seen as discrete

– E,g, physical_part_of part_of

• Apriori unknown which types overlap unless formal description available

– Semantically similar types compete for the same patterns

44

Page 45: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 45

Relationship similarities

Pertinence Measure similarity between pattern vectors as approximation of intensional similarity

Page 46: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Pertinence for Relationships

Do not punish the occurrence of the same pattern with relationship types that are intensionally similar, but extensionally dissimilar

Reduce impact of extensionally similar relations

46

Page 47: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 47

Pertinence Example

Relationship p(R|P)biological_process_has_associated_location 0.968371381

disease_has_associated_anatomic_site 0.880452774

part_of 0.622532958

has_finding_site 0.561041318

has_location 0.537424451

has_direct_procedure_site 0.363832078

Sum: 3.933654958

Pattern: <Subject> in the right <Object>

Note: This never causes p(R,S,O) > 1

Page 48: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 48

Similarities between relationships

Page 49: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 49

Pertinence evaluation

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PertinenceNo Pertinence

Recall

Pre

cisi

on

Page 50: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Fact extraction evaluation - DBPedia

50

Pre

cisi

on /

Rec

all

Confidence Threshold

Strict evaluation:Only 1st ranked extracted relation is compared to gold-standard.Averaged over 107 relation types.

60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus

Page 51: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Sample results (DBPedia)

Subject :: Objectsuggested Relationship

Extracted Rank 1

(Rel;Confidence) Rank 2 Rank 3

Howard Pawley :: Gary Filmon

aftersuccessor;0.799

after;0.768

office;0.686

Mulan :: Tarzan afternextSingle;0.603

followedBy;0.533

after;0.416

Species Deceases:: Midnight Oil

artistproducer;0.761

artist;0.719

genre;0.467

The Crystal City :: Orson Scott Card

authorartist;0.625

author;0.617

writer;0.583

Horatio Allen :: William Maxwell

before predecessor;0.629 before;0.475

Basdeo Panday :: Trinidad &Tobago

birthplace deathPlace;0.658birthplace;0.658

nationality;0.330

Bob Nystrom :: Stockholm

birthplace cityOfBirth;0.677 birthplace;0.513

Beccles railway station :: Suffolk

borough district;0.772borough;0.770

friend;0.749

51

Page 52: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Fact extraction evaluation - UMLS

52

Pre

cisi

on /

Rec

all

Confidence Threshold

Strict evaluation:Only 1st ranked extracted relation is compared to gold-standard.Averaged over ~100 relation types.

60% training set, 40% testing, UMLS fact corpus, MedLine text corpus

Page 53: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Sample results (UMLS)

Subject :: Object suggested Relationship Extracted Rank 1

Teeth::poisoning, fluoride finding_site_of finding_site_of768 polyps::polyp of cervix nos (disorder) associated_with associated_with

neck of uterus::polyp of cervix nos (disorder) location_of finding_site_of

benign neoplasms::polyp of colon related_to associated_with

brain ischemia::brain has_finding_site location_of

gastrointestinal tract::polyp of colon is_primary_anatomic_site_of_disease location_of

gamete structure (cell structure)::polyvesicular vitelline tumor

is_normal_cell_origin_of_disease

is_normal_cell_origin_of_disease

53

Page 54: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 54

Comparison – DBPedia corpusMintz: extraction

of 102 relation-ship types from Freebase

Doozer: 107 from DBPedia

M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data,” in ACL2009.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Mintz-POSMintz-NLPDoozer++ (R)Doozer++ (P)

Recall

Pre

cisi

on

(R) Recall-oriented, using pattern generalization

(P) Precision- oriented, no generalization

Page 55: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Evaluate Ad-Hoc Model Creation

• On demand creation of models

55

Domain QueryNumber of Concepts

Precision (Domain

Definition)

Semantic Web “Semantic Web” OWL ontologies RDF 143 0.98

Harry Potter “Harry Potter” dumbledore gryffindor slytherin 134 0.98

Beatles Beatles "John Lennon" "Paul McCartney" song 250 0.99

India-Pakistan Relations India Pakistan Kashmir 129 0.99US Financial crisis - TARP

tarp "financial crisis" "toxic assets" 146 0.93

German Chancellors

German chancellors "Angela Merkel" "Helmut Kohl" 124 0.91

Page 56: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 56

Ad-Hoc Model Creation - Evaluation

Page 57: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 57

Ad-Hoc Model Creation - Evaluation

Relative Recall

Recall wrt. possible extraction. I.e. the maximum number of extracted facts marks 100% recall

Page 58: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 58

Related Work

Mintz

SOFIETurney

Structural

Open IE

Supervised

Distant Supervision

Coupled learner

Sur-face pat-terns only

Perti-nence for Semantic

simi-larity

Page 59: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Main Differences

• Surface-patterns only• Only positive training examples• Pertinence measure for semantic similarity• Concept-aware: start with defined concepts• Include background knowledge in

probabilistic classification instead of rule-based reasoning

59

Page 60: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 60

Related work

• Pattern-based fact extraction– E. Agichtein and L. Gravano. Snowball: Extracting

relations from large plain-text collections. In JCDL, 2000.

– Suchanek, Fabian M., Mauro Sozio, and Gerhard Weikum. SOFIE : A Self-Organizing Framework for Information Extraction. WWW 2009.�

– T. M. Mitchell, J. Betteridge, A. Carlson, E. Hruschka, and R. Wang. Populating the Semantic Web by Macro-Reading Internet Text. ISWC 2009.

– M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts-step one: the one-million fact extraction challenge. In AAAI 2006.

Page 61: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 61

Related work

• Relationship-pattern computations– P. D. Turney and P. Pantel. From Frequency to

Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 2010.

– P. D. Turney. Expressing implicit semantic relations without supervision. In ACL 2006

Page 62: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 62

Summary Fact extraction

• Pattern-based fact extraction with generalization and Pertinence achieves competitive precision and recall while being computationally feasible for large-scale extraction– Pertinence computation can also be a

preprocessing step for other ML techniques• Different types of background knowledge

incorporated into one statistical framework– Combined Language model and Semantic

model

Page 63: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Application and Knowledge Validation

63

Scooner: Semantic browsing and retrieval – Evaluation in Use

Doozer++– Hierarchy extraction– Pattern-based fact

extraction

• 18 Million MedLine publications/abstracts

• UMLS Metathesaurus

• Wikipedia

Example: Domain model as a basis for research in the area of human cognitive performance.

Page 64: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 64

Domain Definition – Extracted Hierarchy

A hierarchy extracted for a cognitive science domain model.

The keyword description given to the system was a collection of terms relevant to human performance and cognition.

Page 65: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Domain Description: Connect Concepts

65

Page 66: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Expert Evaluation of Facts in the Model

0.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9

Fractio

n

Score

Fraction in binCumulative incorrectCumulative correctCumulative interesting

7-9: Correct Information not commonly known

1-2: Information that is overall incorrect

3-4: Information that is somewhat correct

5-6: Correct general Information

66

Page 67: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 67

Extractor Confidence vs. Correctness

• Analysis shows that highest quality extractions have the highest confidence, but also incorrectly extracted facts have high confidence

High-quality patterns as well as some noise-patterns have high indicative power.

Page 68: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 68

Extractor Confidence vs. Correctness

• Many facts deemed interesting were extracted based on highly specialized patterns in the long tail of the frequency distribution.

• Noisy patterns also tend to occupy this space

Page 69: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Sources of Errors

• Extracted relationship too specific or formally incorrect but metaphorically correct.– <Interpeduncular_Cistern disease_has_associated_

anatomic_site Cerebral_peduncle> is incorrect, • Interpeduncular Cistern is not a disease. However, it does have

the associated anatomic site Cerebral peduncle.

• Incorrect directionality– <Pituitary_Gland sends_output_to Supraoptic_

nucleus> should be <Supraoptic_nucleus sends_ output_to Pituitary_Gland>• Direction in text often expressed in the context rather than the

immediate pattern

69

Page 70: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Validation

• Extracted statements need to be validated to be considered knowledge– Explicit validation, e.g. thumbs up/down– Implicit validation, e.g. by analyzing click streams

70

Page 71: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Explicit Validation

• Certainty of reference– I.e. we know exactly which statement was

validated

• Validator credentials can be obtained– E.g. a small community of experts may evaluate

• Extra work– Explicit validation is a task that is consciously

performed

71

Page 72: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Implicit Validation

• Find indications of correctness or incorrectness based on the way the users interact with the presented information– Every action taken on a piece of information is

recorded and analyzed– The cumulative behavior of the users gives an

indication of which propositions are correct or interesting

72

Page 73: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Implicit Validation

• Examples for implicit community-validation– Games with a purpose (L. von Ahn)– Google search rankings

• Scooner semantic browser– Browse literature along facts in a model– Browsing trails suggest correct extraction

73

Page 74: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Implicit Validation

• A fact is browsed very often by different users.– The fact is interesting to many users. – The fact is surprising and interesting, but may be incorrect.

• A user follows a trail of multiple fact-triples trough a variety of documents.– The facts that were browsed have a high probability of being correct and support is

added to the triples.– If the trail was longer than suggested by a small-world phenomenon, initial triples

may have been incorrect, but led to interesting ones. For this reason, only the last k triples of the trail should garner support or the support should increase for the last k triples in the trail.

– The last triple in the trail may have been incorrect and led to browsing results that caused the user to stop browsing. For this reason, the last triple of the trail should be treated with caution.

74

Page 75: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 75

Validation “through use”

Enter search terms

Choose entity of interest

Browse extracted facts

Choose relevant literature that

supports the fact

Page 76: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 76

Validation “through use”

Fact trails are recorded

Find another interesting fact

Page 77: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 77

Validation “through use”Path suggests that at least the first 2 triples are factually correct

Page 78: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 78

Browsed Facts Examples

Page 79: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 79

Related work

• Evaluation and Use– E. Agichtein, E. Brill, and S. Dumais. Improving web

search ranking by incorporating user behavior information. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’06, page 19, 2006.

– A. Das, M. Datar, A. Garg, and S. Rajaram. Google News Personalization: Scalable Online Collaborative Filtering. In Proceedings of the 16th international conference on World Wide Web, page 280. ACM, 2007.

Page 80: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Summary Knowledge Acquisition

• The model actually reflects what the user is interested in at the point of creation Willingness to help validate facts– Applications allow for implicit and explicit

evaluation• Validated Statements can be merged with

existing knowledge Automated acquisition completed Individual-driven KA improved overall system

80

• R. Kavuluru, C. Thomas et al. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains. IHI 2012

• Amit Sheth, Christopher Thomas, Pankaj Mehra, 'Continuous Semantics to Analyze Real-Time Data', IEEE IC, Nov./Dec. 2010• C. Thomas et al. Improving Linked Open Data through On-Demand Model Creation. Web Science Conference, 2010.• C. Thomas, et al.. Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction. WIC 2008.

Page 81: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Future Directions

• Active Learning to improve classification– Easy in tightly connected system (e.g. NELL)– Feedback mechanism for loosely connected

systems • Improve depth of classification

– Augment Domain Description with learned concept hierarchies from text (e.g. Navigli)

• Knowledge management for background knowledge– Belief updates– Model evolution

81

Page 82: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 82

ContributionsConceptual Knowledge: Ontologies, LoD

Taxonomy extraction [WI1, WebSci, WebSem1]Event modeling [IEEE-IC]Relationship/Fact/Event extraction [IHI, WebSem1, IEEE-IC, WebSci]

Information Quality[WI2]Social processes for content creation [CHB]

Textual Information: Wikipedia, Web

Knowledge merging/Ontology alignment [AAAI, WebSem2, SWSWPC]

Social processes for knowledge validation [IHI,WebSci, CHB]

Knowledge Representation [IJSWIS, CR, FLSW]Ontology design [WWW, FOIS]

8282

Page 83: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 83

Journal/Conference Publications

[WebSem] C. Thomas, P. Mehra, A. Sheth, W. Wang, G. Weikum. Automatic domain model creation using pattern-based fact extraction. Submitted to Journal of Web Semantics.

[IHI]R. Kavuluru, C. Thomas, A. Sheth, V. Chan, W. Wang, A. Smith, A. Sato and A. Walters. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains. IHI 2012 - 2nd ACM SIGHIT International Health Informatics Symposium, January 28-30, 2012.

[IEEE-IC] Amit Sheth, Christopher Thomas, Pankaj Mehra, 'Continuous Semantics to Analyze Real-Time Data', IEEE Internet Computing, vol. 14, no. 6, pp. 84-89, Nov./Dec. 2010, doi:10.1109/MIC.2010.137

[WebSci] C. Thomas, W. Wang, P. Mehra and A. Sheth. What Goes Around Comes Around Improving Linked Opend Data through On-Demand Model Creation. Web Science Conference, 2010.

[WI1] C. Thomas, P. Mehra, R. Brooks, and A. Sheth. Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, 1:496–502, 2008.

Page 84: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 84

Journal/Conference Publications

[WI2] C. Thomas and A. Sheth. Semantic Convergence of Wikipedia Articles. In Proceedings of the 2007 IEEE/WIC International Conference on Web Intelligence, pages 600–606, Washington, DC, USA, November 2007. IEEE Computer Society.

[WWW] S. S. Sahoo, C. Thomas, A. Sheth, W. S. York, and S. Tartir. Knowledge Modeling and its Application in Life Sciences: A Tale of two Ontologies. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 317–326, New York, NY, USA, 2006. ACM Press.

[FOIS] C. Thomas, A. Sheth, and W. York. Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain. In Proceeding of the 2006 conference on Formal Ontology in Information Systems: Proceedings of the Fourth International Conference (FOIS 2006), pages 115–127, Amsterdam (NL), 2006. IOS Press.

[AAAI] P. Doshi and C. Thomas. Inexact matching of ontology graphs using expectation-maximization. In AAAI’06: proceedings of the 21st national conference on Artificial intelligence, pages 1277–1282. AAAI Press, 2006.

Page 85: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 85

Publications

[CHB] C. Thomas and A. Sheth. Web Wisdom - An Essay on How Web 2.0 and Semantic Web can foster a Global Knowledge Society. Computers in Human Behavior, Elsevier.

[WebSem2] P. Doshi, R. Kolli, and C. Thomas. Inexact matching of ontology graphs using expectation-maximization. Web Semantics: Science, Services and Agents on the World Wide Web, 7(2):90–106, 2009.

[IJWGS] V. Kashyap, C. Ramakrishnan, C. Thomas, and A. Sheth. Taxaminer: an experimentation framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services, 1(2):240–266, 2005.

[IJSWIS] A. P. Sheth, C. Ramakrishnan, and C. Thomas. Semantics for the semantic web: The implicit, the formal and the powerful. Int. J. Semantic Web Inf. Syst., 1(1):1–18, 2005.

[CR] S. Sahoo, C. Thomas, A. Sheth, C. Henson, and W. York. GLYDEan expressive XML standard for the representation of glycan structure. Carbohydrate research, 340(18):2802–2807, 2005.

Page 86: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 86

Other Publications

Workshop Publications

[SWLS] A. Sheth, W. York, C. Thomas, M. Nagarajan, J. Miller, K. Kochut, S. Sahoo, and X. Yi. Semantic Web technology in support of Bioinformatics for Glycan Expression. In W3C Workshop on Semantic Web for Life Sciences, pages 27–28, 2004.

[SWSWPC] N. Oldham, C. Thomas, A. Sheth, and K. Verma. METEOR-S Web Service Annotation Framework with Machine Learning Classification. Semantic Web Services and Web Process Composition, pages 137–146, 2005, Springer.

Book Chapters

[FLSW] C. Thomas and A. Sheth. On the expressiveness of the languages for the semantic web - making a case for a little more. Fuzzy Logic and the Semantic Web, pages 3–20, 2006.

Patent

[PAT] P. Mehra, R. Brooks and C. Thomas. ONTOLOGY CREATION BY REFERENCE TO A KNOWLEDGE CORPUS. Pub.No. US 2010/0280989 A1

Page 87: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

• Research– KR– Domain model

extraction / IE

• Collaborations– Complex Carbohydrate Research

Center at UGA

– HP Labs Palo Alto– Human Performance

Directorate, AFRL• Proposals

– HP Incubation & Innovation grant for Doozer++

– AFRL grant largely based on Doozer++

– NSF proposal submitted with “very good” reviews

• Tools and Ontologies– GlycO– GlycoViz– Doozer++– Scooner

87

Page 88: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science 88

Thank you!

Gerhard Weikum

Shaojun Wang

Pascal Hitzler

Pankaj Mehra

Amit Sheth

Thanks to all Kno.e.sis Center Members

–Past and Present

Page 89: PhD thesis defense of Christopher Thomas

Knowledge Enabled Information and Services Science

Thank you

89