Knowledge Enabled Information and Services Science Knowledge Acquisition on the Web Growing the...

Knowledge Enabled Information and Services Science

Knowledge Acquisition on the Web

Growing the amount of available knowledge from within

Christopher Thomas

Knowledge Enabled Information and Services Science 2

Overview

• Knowledge Representation– GlycO – Complex Carbohydrates domain

ontology• Information Extraction

– Taxonomy creation (Doozer/Taxonom.com)– Fact Extraction (Doozer++)

• Validation

Circle of knowledge on the Web

Goal:Harness the Wisdom of the

Crowds to automatically model a domain, verify the model and

give the verified knowledge back to the community

What is knowledge?

How do we turn propositions/beliefs into knowledge?

How do we acquire knowledge?

Background Knowledge

[15] Christopher Thomas and Amit Sheth, “On the Expressiveness of the Languages for the Semantic Web–Making a Case for ‘A Little More,’”in Fuzzy Logic and the Semantic Web, Eli Sanchez (Ed.), Elsevier, 2006.

[11] Amit Sheth, Cartic Ramakrishnan, and Christopher Thomas, “Semantics for The Semantic Web: the Implicit, the Formal and the Powerful,”International Journal on Semantic Web & Information Systems, 1 (no. 1), 2005, pp. 1–18.

Different Angles

• Social construction– Large scale creation of knowledge

vs.– Small communities define their domains

• Normative vs. Descriptive=Top-Down vs. Bottom-Up

• Formal vs. Informal=Machine-readable vs. human-readable

Community-created knowledge

• Descriptive• Bottom-up• Formally less rigid• May contain false information• If a statement in the world is in conflict with

the Ontology, both may be wrong or both may be right

• Good for broad, shallow domains• Good for human processing and IR tasks

Wikipedia and Linked Open Data

• Created by large communities• Constantly growing• Domains within the linked data are not

always easily discernible• Contain few axioms and restrictions

– Little value to evaluation using logics

Formal - Modeling deep domains

• Prescriptive / Normative• Top-down• Contains “true knowledge”• If a statement in the world is in conflict with the

Ontology, the statement is false• Good for scientific domains• Good for computational reasoning/inference• Usually created by small communities of experts• Usually static, little change is expected

Example: GlycO

• Created in collaboration with the Complex Carbohydrate Research Center at the University of Georgia on an NCRR grant.

• Deep modeling of glycan structures and metabolic pathways

[6] Christopher Thomas, Amit P. Sheth, and William S. York, “Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain,”in Formal Ontology in Information Systems (FOIS 2006)

[5] Satya S. Sahoo, Christopher Thomas, Amit P. Sheth, William York, and Samir Tartir, “Knowledge Modeling and Its Application in Life Sciences: A Tale of Two Ontologies,”15th International World Wide Web Conference (WWW2006),

Knowledge Enabled Information and Services Science12

N-Glycosylation metabolic pathway

GNT-Iattaches GlcNAc at position 2

UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2

GNT-Vattaches GlcNAc at position 6

UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021

N-acetyl-glucosaminyl_transferase_VN-glycan_beta_GlcNAc_9N-glycan_alpha_man_4

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251

b-D-Manp-(1-6)+ | b-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc |b-D-Manp-(1-3)+

Glycan Structures for the ontology

• Import structures from heterogeneous databases

• Possible connections modeled in the form of GlycoTree

• Match structures to archetypes

Interplay of extraction and evaluation

• Errors in the source databases are propagated through various new databases comparing multiple sources fails for error correction

• Less than 2% of incorrect information makes a database useless for automatic validation of hypotheses

• The ontology contains rules on how carbohydrate structures are known to be composed

• By mapping information in databases to the ontology and analyzing how successful the mapping was, we can identify possible errors.

Database Verification using GlycO

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251

b-D-Manp-(1-6)+ | a-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc |b-D-Manp-(1-3)+

a-D-Manp-(1-4) is not part of the identified canonical structure for N-Glycans, hence it is likely that the database entry is incorrect

Pathway Steps - Reaction

Evidence for this reaction from three experiments

Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia

Summary - GlycO

• The amount of accuracy and detail that can be found in ontologies such as GlycO could most likely not be acquired automatically

• Only a small community of experts has the depth of knowledge to model such scientific ontologies

Summary - GlycO

• However, the automatic population shows that a highly restrictive, expert-created rule set allows for automation or involvement of larger communities.

• Frame-based population of knowledge• The formal knowledge encoded in the

ontology serves to acquire new knowledge• The circle is completed

Summary Background Knowledge

• Large amounts of information and knowledge are available

• Some machine readable by default• Others need specific algorithms to extract

information• The more available information we can use,

the better the extraction of new information will be.

What is knowledge?

How do we turn propositions into knowledge?

Part 2

Model Creation

[3] Christopher Thomas, Pankaj Mehra, Roger Brooks and Amit Sheth. Growing Fields of Interest -Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence 2008, pp. 496-502

[2] Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, Pankaj Mehra and Amit Sheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, WebScience 2010

[1] Christopher Thomas, Pankaj Mehra, Wenbo Wang, Amit Sheth, Gerhard Weikum and Victor Chana Automatic Domain Model Creation Using Pattern-Based Fact Extraction, Knoesis Center technical report.

Knowledge Acquisition through

First create a domain hierarchy

Example: a hierarchy for the domain of Human Performance and Cognition

Connect with learned facts

Example: strongly connected component

Excerpt: strongly connected component

Expert evaluation of facts in the ontology

7-9: Correct Information not commonly known

1-2: Information that is overall incorrect

3-4: Information that is somewhat correct

5-6: Correct general Information

Technical Details

Domain hierarchy creation

• Input terms e.g. related to Human Performance and Cognition

• Hierarchy is automatically carved from articles and categories on Wikipedia

Step 1

Overview - conceptual

• Expand and Reduce approach– Start with ‘high recall’ methods

• Exploration - Full text search• Exploitation – Node Similarity Method• Category growth

– End with “high precision” methods• Apply restrictions on the concepts found• Remove unwanted terms and categories

Graph-based expansion

Expand - conceptually

Full text search on Article texts

Delete results with low confidence score

Collecting Instances

Creating a Hierarchy

Step 2: Pattern-Based Relationship Extraction

Extracting meaningful relationships by macro-reading

free text

Extracting from Plain text or hypertext

• Informal, human-readable presentation of information

• Vast amounts of information available– Web– Scientific publications– Encyclopediae

• Need sophisticated algorithms to extract information

Pattern-based Fact Extraction

• Learn textual patterns that express known relationship types

• Search the text corpus for occurrences of known entities (e.g. from domain hierarchy)

• Semi-open– Types are known and limited– Types are automatically expanded when LOD

grows• Vector-Space Model• Probabilistic representation

Training

• Relationship data in the UMLS Metathesaurus or the Wikipedia Infobox-data provide a large set of facts in RDF Triple format– Limited set of relationships that can

be arranged in a schema– Semi-open

• Types are known and limited• Types are automatically expanded

when LOD grows

Training procedure

• Iterate through all facts (S->P->O triples)• Find evidence for the fact in a corpus

– Wikipedia, WWW, PubMed or any other collection

– If triple subject and triple object occur in close proximity in text, add the pattern in-between to the learned patterns

• Combined evidence from many different patterns increases the certainty of a relationship between the entities

Overview – initial computations

Fact Collection

Text Corpus

EntropySVD/LSI

CP2P CP2PmodCP2P R2P

Modifications *

Pertinence

Matrix Computations

*R2Pmod

Training procedure cont’d

Canberra::Australia

Canberra, the Australian capital city

Canberra, capital of theCommonwealth of Australia

Canberra, the Australian capital

Canberra, the Australian capital city

<Subject>, the <Object> capital city

<Subject>, capital of the Commonwealth of

<Subject>, the <Object> capital

Relationship Patterns

X, the Y capital city

X, capital of theCommonwealth of Y

X, the Y capital

Capital_of 1 1 1

X, the Y capital city

X, capital of Y X, the Y capital

Capital_of 1 1 1

Extracted Synonyms

X, the Y capital * X, capital of Y

Capital_of 2 1

Generalize

X, the Y capital *

X, capital of Y X, * * Y X, predecessor of Y

Capital_of 2 2 2 0

predecessor 0 0 2 2

X, the Y capital *

X, capital of Y X, * * Y X, predecessor of Y

Capital_of 1.0 1.0 0.5 0

predecessor 0 0 0.5 1.0

Resolve Relationships

X, the Y capital *

X, capital of Y

X, * * Y X, predeces-sor of Y

Capital_of

1.0 1.0 0.5 0

predecessor

0 0 0.5 1.0

0.5 X, the Y capital *

0.25 X, capital of Y

0.25 X, * * Y

0 X, predecessor of Y

X, the Y

capital *

X, capital of Y

X, * * Y X, predecessor

Capital_of

1.0 1.0 0.5 0

predecessor

0 0 0.5 1.0

0.25 X, capital of Y

0.25 X, * * Y

X, predecessor of Y

xCapital_of predecessor

0.875 0.125

Advanced Computations

Fact Collection

Text Corpus

EntropySVD/LSI

CP2P CP2PmodCP2P R2P

Modifications *

Pertinence

Matrix Computations

*R2Pmod

Advanced Computations

EntropySVD/LSI Pertinence

Matrix Computations

*R2Pmod

LSI to determine relationship similaritiesReduces sparsity in the matrix and makes relationship rows more comparableAllows better use of pertinence computation

EntropyIncrease weights for more unique patterns

PertinenceSmoothing of pattern occurrence frequencies

Example Output (DBPedia)

Subject :: Object

Extracted Rank 1

(Rel;Confidence) Rank 2 Rank 3

Howard Pawley :: Gary Filmon

successor;0.799

after;0.768

office;0.686

Species Deceases:: Midnight Oil

producer;0.761

artist;0.719

genre;0.467

The Crystal City :: Orson Scott Card

artist;0.625

author;0.617

writer;0.583

Horatio Allen :: William Maxwell predecessor;0.629 before;0.475

Basdeo Panday :: Trinidad &Tobago deathPlace;0.658

birthplace;0.658

nationality;0.330

Beccles railway station :: Suffolk district;0.772

borough;0.770

friend;0.749

Pertinence for Relations

• Looking at fact extraction as a classification of concept pairs into classes of relations

• Class boundaries are not clear cut• E.g. has_physical_part has_part• don’t punish the occurrence of the same

pattern with relationship types that are similar

X, the Y capital *

X, capital of Y X, * * Y X, located in Y

Capital_of 2 2 2 2

Located_in 0 0 2 4

X, the Y capital *

X, capital of Y X, * * Y X, located in Y

Capital_of 1.0 1.0 0.2 0.5

Located_in 0 0 0.2 0.9

X, the Y capital *

X, capital of Y

X, * * Y X, located in Y

Capital_of

1.0 1.0 0.2 0.5

Located_in

0 0 0.2 0.9

0.1 X, capital of Y

0.3 X, * * Y

0.2 X, located in Y

xCapital_of Located_in

0.66 0.24

Evaluation of the fact extraction - DBPedia

Confidence Threshold

Strict evaluation:Only 1st ranked extracted relation is compared to gold-standard.Averaged over relation types.

60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus

Evaluation of the fact extraction - UMLS

Confidence Threshold

Strict evaluation:Only 1st ranked extracted relation is compared to gold-standard.Averaged over relation types.

60% training set, 40% testing, UMLS fact corpus, MedLine text corpus

Manual Evaluation strategy (DBPedia)

Score Subject :: Objectsuggested Relationship

Extracted Rank 1

(Rel;Confidence) Rank 2 Rank 3

1Howard Pawley :: Gary Filmon after

successor;0.799

after;0.768

office;0.686

0.5 Mulan :: Tarzan afternextSingle;0.603

followedBy;0.533

after;0.416

Species Deceases:: Midnight Oil artist

producer;0.761

artist;0.719

genre;0.467

1The Crystal City :: Orson Scott Card author

artist;0.625

author;0.617

writer;0.583

1Horatio Allen :: William Maxwell before predecessor;0.629 before;0.475

1Basdeo Panday :: Trinidad &Tobago birthplace deathPlace;0.658

birthplace;0.658

nationality;0.330

1Bob Nystrom :: Stockholm birthplace cityOfBirth;0.677 birthplace;0.513

1Beccles railway station :: Suffolk borough district;0.772

borough;0.770

friend;0.749

Manual Evaluation strategy (UMLS)

poisoning, fluoride::teeth[finding_site_of] finding_site_of 1

polyneuritis, endemic::vitamin b 1[associated_with] has_form 0

polyp of cervix nos (disorder)::768 polyps[associated_with] associated_with 1

polyp of cervix nos (disorder)::neck of uterus[location_of] finding_site_of 1

polyp of colon::benign neoplasms[related_to] associated_with 0.5

brain::brain contusion [has_location]associated_morphology_of 0.25

brain::brain ischemia [has_finding_site] location_of 0.5polyp of colon::gastrointestinal tract, nos[is_primary_anatomic_site_of_disease] location_of 0.5

polyvesicular vitelline tumor::gamete structure (cell structure)[is_normal_cell_origin_of_disease]

is_normal_cell_origin_of_disease 1

proptosis::apert syndrome[has_manifestation] has_manifestation 1

Manually evaluated precision for different confidence values

Manually evaluated precision, confidence > 0.5 (on UMLS – MedLine corpus)

UMLS - Pert - Ent

Summary Model Creation

• Using background knowledge in the form of a fact corpus and a text corpus, we can suggest new facts/propositions

• Possible to try all combinations of known concepts (e.g. Read-the-Web project), but huge validation backlog

• Letting users drive the model creation focuses the creation on the parts that are of common interest

Willingness to help validate facts

What is knowledge?

How do we turn propositions/beliefs into knowledge?

Part 3

Evaluation and Use

Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, Pankaj Mehra and Amit Sheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, to appear in WebScience 2010

Current Work

Explicit evaluation

• “Evaluate for evaluation’s sake”– Domain-experts rank the value of a proposition– Committees of experts and/or laymen vote on

the correctness of propositions

Explicit evaluation in the Semantic Browser

• The user can vote on facts• Some facts are presented randomly• Most facts are presented after the user (by

browsing) showed interest in– The full triple– Subject/Object of the triple

Implicit evaluation

• Evaluation that does not explicitly involve a vote on the extracted information

• Use the Wisdom of the Crowds• Users show support for a proposition by

performing an action• Every action taken on a piece of

information is recorded and analyzed• The cumulative behavior of the users gives

an indication of which propositions are correct or interesting

Implicit evaluation in the Semantic Browser

• The user simply searches and browses• The search history and the click-stream

provide information about whether a page transition using an extracted triple was successful

• Assumption: on average, a successful trail-browsing session includes valid triples

• Problem: requires extensive use

Implicit evaluation in the Semantic Browser

1st triple

2nd triple

Triples

Conclusion

• Creating domain models gives a way of selectively adding knowledge to a system

• We showed that it is possible to automatically create such models with high accuracy

• The models immediately impact users Willingness to help evaluate

• Evaluation becomes integral part in knowledge lifecycle

Thank you

Knowledge Enabled Information and Services Science Knowledge Acquisition on the Web Growing the...

Documents

Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Approaches to Rewarding Digital Talent Break-out · Digitally-enabled organisations Non-digitally enabled organisations Employees with defined knowledge / skills 50% 18% Employees

Technology Enabled Knowledge Translation: The context of e

The Performance Effects of IT-Enabled Knowledge Management ... · The Performance Effects of IT-Enabled Knowledge Management Practices Abstract The extensive literature on knowledge

1 Kirrkirr: Transforming the Representation of Lexical Knowledge Christopher Manning University of Sydney

An Expert in Every Chair - Knowledge-enabled MRP

KANTeNET Knowledge Enabled Sensor Network Middleware

Low Cost 3G Enabled Data Logger for Photovoltaic Systems · Low Cost 3G Enabled Data Logger for Photovoltaic Systems Christopher Tapper ... DFRobot GPS/GPRS/GSM Shield ... The initial

Ohio Center of Excellence on Knowledge-Enabled Computing (Kno.e.sis)

Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Computing at Wright State (Kno.e.sis)

From Strings to Things: Knowledge-Enabled VQA Model That Can … · 2019-03-10 · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1

Ohio Center of Excellence in Knowledge-enabled Computing - Proposal

Mobilizing Knowledge to Improve Health Services Tech-enabled … · 2015-02-02 · Mobilizing Knowledge to Improve Health Services Tech-enabled KT Close to Care Presentation to BC

Unlocking Knowledge, Empowering Minds September 1, 2006 Unlocking Knowledge, Empowering Minds Christopher Merlan and Curt Newton

Knowledge base enabled Information Filtering on Social Web -- EMC

RoboEarth Web-Enabled and Knowledge-Based Active ...webdiis.unizar.es/~msalasg/docs/paper04-final.pdffrom RoboEarth, a web-based knowledge base for exchanging knowledge between robots,

Knowledge Enabled Location Prediction of Twitter Users

Knowledge Enabled Information and Services Science Glycomics project overview

Knowledge Enabled Information and Services Science GlycO

Web-Enabled Knowledge-Intensive Support Framework for