View
224
Download
0
Category
Tags:
Preview:
Citation preview
Knowledge Enabled Information and Services Science
Knowledge Acquisition on the Web
Growing the amount of available knowledge from within
Christopher Thomas
1
Knowledge Enabled Information and Services Science 2
Overview
• Knowledge Representation– GlycO – Complex Carbohydrates domain
ontology• Information Extraction
– Taxonomy creation (Doozer/Taxonom.com)– Fact Extraction (Doozer++)
• Validation
Knowledge Enabled Information and Services Science 3
Circle of knowledge on the Web
Knowledge Enabled Information and Services Science
Goal:Harness the Wisdom of the
Crowds to automatically model a domain, verify the model and
give the verified knowledge back to the community
4
Knowledge Enabled Information and Services Science 5
Circle of knowledge on the Web
What is knowledge?
How do we turn propositions/beliefs into knowledge?
How do we acquire knowledge?
Knowledge Enabled Information and Services Science
Background Knowledge
[15] Christopher Thomas and Amit Sheth, “On the Expressiveness of the Languages for the Semantic Web–Making a Case for ‘A Little More,’”in Fuzzy Logic and the Semantic Web, Eli Sanchez (Ed.), Elsevier, 2006.
[11] Amit Sheth, Cartic Ramakrishnan, and Christopher Thomas, “Semantics for The Semantic Web: the Implicit, the Formal and the Powerful,”International Journal on Semantic Web & Information Systems, 1 (no. 1), 2005, pp. 1–18.
6
Knowledge Enabled Information and Services Science 7
Different Angles
• Social construction– Large scale creation of knowledge
vs.– Small communities define their domains
• Normative vs. Descriptive=Top-Down vs. Bottom-Up
• Formal vs. Informal=Machine-readable vs. human-readable
Knowledge Enabled Information and Services Science
Community-created knowledge
• Descriptive• Bottom-up• Formally less rigid• May contain false information• If a statement in the world is in conflict with
the Ontology, both may be wrong or both may be right
• Good for broad, shallow domains• Good for human processing and IR tasks
8
Knowledge Enabled Information and Services Science
Wikipedia and Linked Open Data
• Created by large communities• Constantly growing• Domains within the linked data are not
always easily discernible• Contain few axioms and restrictions
– Little value to evaluation using logics
9
Knowledge Enabled Information and Services Science
Formal - Modeling deep domains
• Prescriptive / Normative• Top-down• Contains “true knowledge”• If a statement in the world is in conflict with the
Ontology, the statement is false• Good for scientific domains• Good for computational reasoning/inference• Usually created by small communities of experts• Usually static, little change is expected
10
Knowledge Enabled Information and Services Science
Example: GlycO
• Created in collaboration with the Complex Carbohydrate Research Center at the University of Georgia on an NCRR grant.
• Deep modeling of glycan structures and metabolic pathways
[6] Christopher Thomas, Amit P. Sheth, and William S. York, “Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain,”in Formal Ontology in Information Systems (FOIS 2006)
[5] Satya S. Sahoo, Christopher Thomas, Amit P. Sheth, William York, and Samir Tartir, “Knowledge Modeling and Its Application in Life Sciences: A Tale of Two Ontologies,”15th International World Wide Web Conference (WWW2006),
11
Knowledge Enabled Information and Services Science12
GlycO
Knowledge Enabled Information and Services Science
N-Glycosylation metabolic pathway
GNT-Iattaches GlcNAc at position 2
UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2
GNT-Vattaches GlcNAc at position 6
UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021
N-acetyl-glucosaminyl_transferase_VN-glycan_beta_GlcNAc_9N-glycan_alpha_man_4
13
Knowledge Enabled Information and Services Science
N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251
b-D-Manp-(1-6)+ | b-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc |b-D-Manp-(1-3)+
Glycan Structures for the ontology
• Import structures from heterogeneous databases
• Possible connections modeled in the form of GlycoTree
• Match structures to archetypes
14
Knowledge Enabled Information and Services Science
Interplay of extraction and evaluation
• Errors in the source databases are propagated through various new databases comparing multiple sources fails for error correction
• Less than 2% of incorrect information makes a database useless for automatic validation of hypotheses
• The ontology contains rules on how carbohydrate structures are known to be composed
• By mapping information in databases to the ontology and analyzing how successful the mapping was, we can identify possible errors.
15
Knowledge Enabled Information and Services Science 16
Database Verification using GlycO
N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251
b-D-Manp-(1-6)+ | a-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc |b-D-Manp-(1-3)+
a-D-Manp-(1-4) is not part of the identified canonical structure for N-Glycans, hence it is likely that the database entry is incorrect
Knowledge Enabled Information and Services Science
Pathway Steps - Reaction
Evidence for this reaction from three experiments
Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia
17
Knowledge Enabled Information and Services Science 18
Knowledge Enabled Information and Services Science 19
Summary - GlycO
• The amount of accuracy and detail that can be found in ontologies such as GlycO could most likely not be acquired automatically
• Only a small community of experts has the depth of knowledge to model such scientific ontologies
Knowledge Enabled Information and Services Science 20
Summary - GlycO
• However, the automatic population shows that a highly restrictive, expert-created rule set allows for automation or involvement of larger communities.
• Frame-based population of knowledge• The formal knowledge encoded in the
ontology serves to acquire new knowledge• The circle is completed
Knowledge Enabled Information and Services Science
Summary Background Knowledge
• Large amounts of information and knowledge are available
• Some machine readable by default• Others need specific algorithms to extract
information• The more available information we can use,
the better the extraction of new information will be.
21
Knowledge Enabled Information and Services Science 22
Circle of knowledge on the Web
What is knowledge?
How do we turn propositions into knowledge?
Part 2
How do we acquire knowledge?
Knowledge Enabled Information and Services Science
Model Creation
[3] Christopher Thomas, Pankaj Mehra, Roger Brooks and Amit Sheth. Growing Fields of Interest -Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence 2008, pp. 496-502
[2] Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, Pankaj Mehra and Amit Sheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, WebScience 2010
[1] Christopher Thomas, Pankaj Mehra, Wenbo Wang, Amit Sheth, Gerhard Weikum and Victor Chana Automatic Domain Model Creation Using Pattern-Based Fact Extraction, Knoesis Center technical report.
Knowledge Acquisition through
[3]
[2]
[1]
23
Knowledge Enabled Information and Services Science
First create a domain hierarchy
Example: a hierarchy for the domain of Human Performance and Cognition
24
Knowledge Enabled Information and Services Science
Connect with learned facts
25
Knowledge Enabled Information and Services Science
Example: strongly connected component
26
Knowledge Enabled Information and Services Science
Excerpt: strongly connected component
27
Knowledge Enabled Information and Services Science
Expert evaluation of facts in the ontology
7-9: Correct Information not commonly known
1-2: Information that is overall incorrect
3-4: Information that is somewhat correct
5-6: Correct general Information
28
Knowledge Enabled Information and Services Science
Technical Details
29
Knowledge Enabled Information and Services Science 30
Domain hierarchy creation
• Input terms e.g. related to Human Performance and Cognition
• Hierarchy is automatically carved from articles and categories on Wikipedia
Step 1
Knowledge Enabled Information and Services Science
Overview - conceptual
• Expand and Reduce approach– Start with ‘high recall’ methods
• Exploration - Full text search• Exploitation – Node Similarity Method• Category growth
– End with “high precision” methods• Apply restrictions on the concepts found• Remove unwanted terms and categories
31
Knowledge Enabled Information and Services Science
Graph-based expansion
32
Expand - conceptually
Full text search on Article texts
Delete results with low confidence score
Knowledge Enabled Information and Services Science 33
Collecting Instances
Knowledge Enabled Information and Services Science 34
Creating a Hierarchy
Knowledge Enabled Information and Services Science
Step 2: Pattern-Based Relationship Extraction
Extracting meaningful relationships by macro-reading
free text
35
Knowledge Enabled Information and Services Science 36
Extracting from Plain text or hypertext
• Informal, human-readable presentation of information
• Vast amounts of information available– Web– Scientific publications– Encyclopediae
• Need sophisticated algorithms to extract information
Knowledge Enabled Information and Services Science 37
Pattern-based Fact Extraction
• Learn textual patterns that express known relationship types
• Search the text corpus for occurrences of known entities (e.g. from domain hierarchy)
• Semi-open– Types are known and limited– Types are automatically expanded when LOD
grows• Vector-Space Model• Probabilistic representation
Knowledge Enabled Information and Services Science
Training
• Relationship data in the UMLS Metathesaurus or the Wikipedia Infobox-data provide a large set of facts in RDF Triple format– Limited set of relationships that can
be arranged in a schema– Semi-open
• Types are known and limited• Types are automatically expanded
when LOD grows
38
Knowledge Enabled Information and Services Science
Training procedure
• Iterate through all facts (S->P->O triples)• Find evidence for the fact in a corpus
– Wikipedia, WWW, PubMed or any other collection
– If triple subject and triple object occur in close proximity in text, add the pattern in-between to the learned patterns
• Combined evidence from many different patterns increases the certainty of a relationship between the entities
39
Knowledge Enabled Information and Services Science
Overview – initial computations
Fact Collection
Text Corpus
EntropySVD/LSI
CP2P CP2PmodCP2P R2P
Modifications *
Pertinence
R2P
Matrix Computations
*R2Pmod
40
Knowledge Enabled Information and Services Science 41
Training procedure cont’d
Canberra::Australia
Canberra, the Australian capital city
Canberra, capital of theCommonwealth of Australia
Canberra, the Australian capital
Canberra, the Australian capital city
<Subject>, the <Object> capital city
<Subject>, capital of the Commonwealth of
<Object>
<Subject>, the <Object> capital
1 1 1
Knowledge Enabled Information and Services Science
Relationship Patterns
X, the Y capital city
X, capital of theCommonwealth of Y
X, the Y capital
Capital_of 1 1 1
X, the Y capital city
X, capital of Y X, the Y capital
Capital_of 1 1 1
Extracted Synonyms
X, the Y capital * X, capital of Y
Capital_of 2 1
Generalize
42
Knowledge Enabled Information and Services Science
Relationship Patterns
X, the Y capital *
X, capital of Y X, * * Y X, predecessor of Y
Capital_of 2 2 2 0
predecessor 0 0 2 2
X, the Y capital *
X, capital of Y X, * * Y X, predecessor of Y
Capital_of 1.0 1.0 0.5 0
predecessor 0 0 0.5 1.0
43
Knowledge Enabled Information and Services Science
Resolve Relationships
X, the Y capital *
X, capital of Y
X, * * Y X, predeces-sor of Y
Capital_of
1.0 1.0 0.5 0
predecessor
0 0 0.5 1.0
0.5 X, the Y capital *
0.25 X, capital of Y
0.25 X, * * Y
0 X, predecessor of Y
x
44
Knowledge Enabled Information and Services Science
Resolve Relationships
X, the Y
capital *
X, capital of Y
X, * * Y X, predecessor
of Y
Capital_of
1.0 1.0 0.5 0
predecessor
0 0 0.5 1.0
0.5 X, the Y capital *
0.25 X, capital of Y
0.25 X, * * Y
X, predecessor of Y
xCapital_of predecessor
0.875 0.125
45
Knowledge Enabled Information and Services Science
Advanced Computations
Fact Collection
Text Corpus
EntropySVD/LSI
CP2P CP2PmodCP2P R2P
Modifications *
Pertinence
R2P
Matrix Computations
*R2Pmod
46
Knowledge Enabled Information and Services Science
Advanced Computations
EntropySVD/LSI Pertinence
R2P
Matrix Computations
*R2Pmod
LSI to determine relationship similaritiesReduces sparsity in the matrix and makes relationship rows more comparableAllows better use of pertinence computation
EntropyIncrease weights for more unique patterns
PertinenceSmoothing of pattern occurrence frequencies
47
Knowledge Enabled Information and Services Science
Example Output (DBPedia)
Subject :: Object
Extracted Rank 1
(Rel;Confidence) Rank 2 Rank 3
Howard Pawley :: Gary Filmon
successor;0.799
after;0.768
office;0.686
Species Deceases:: Midnight Oil
producer;0.761
artist;0.719
genre;0.467
The Crystal City :: Orson Scott Card
artist;0.625
author;0.617
writer;0.583
Horatio Allen :: William Maxwell predecessor;0.629 before;0.475
Basdeo Panday :: Trinidad &Tobago deathPlace;0.658
birthplace;0.658
nationality;0.330
Beccles railway station :: Suffolk district;0.772
borough;0.770
friend;0.749
48
Knowledge Enabled Information and Services Science
Pertinence for Relations
• Looking at fact extraction as a classification of concept pairs into classes of relations
• Class boundaries are not clear cut• E.g. has_physical_part has_part• don’t punish the occurrence of the same
pattern with relationship types that are similar
49
Knowledge Enabled Information and Services Science
Relationship Patterns
X, the Y capital *
X, capital of Y X, * * Y X, located in Y
Capital_of 2 2 2 2
Located_in 0 0 2 4
X, the Y capital *
X, capital of Y X, * * Y X, located in Y
Capital_of 1.0 1.0 0.2 0.5
Located_in 0 0 0.2 0.9
50
Knowledge Enabled Information and Services Science
Resolve Relationships
X, the Y capital *
X, capital of Y
X, * * Y X, located in Y
Capital_of
1.0 1.0 0.2 0.5
Located_in
0 0 0.2 0.9
0.4 X, the Y capital *
0.1 X, capital of Y
0.3 X, * * Y
0.2 X, located in Y
xCapital_of Located_in
0.66 0.24
51
Knowledge Enabled Information and Services Science
Evaluation of the fact extraction - DBPedia
52
Pre
cisi
on /
Rec
all
Confidence Threshold
Strict evaluation:Only 1st ranked extracted relation is compared to gold-standard.Averaged over relation types.
60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus
Knowledge Enabled Information and Services Science
Evaluation of the fact extraction - UMLS
53
Pre
cisi
on /
Rec
all
Confidence Threshold
Strict evaluation:Only 1st ranked extracted relation is compared to gold-standard.Averaged over relation types.
60% training set, 40% testing, UMLS fact corpus, MedLine text corpus
Knowledge Enabled Information and Services Science
Manual Evaluation strategy (DBPedia)
Score Subject :: Objectsuggested Relationship
Extracted Rank 1
(Rel;Confidence) Rank 2 Rank 3
1Howard Pawley :: Gary Filmon after
successor;0.799
after;0.768
office;0.686
0.5 Mulan :: Tarzan afternextSingle;0.603
followedBy;0.533
after;0.416
1
Species Deceases:: Midnight Oil artist
producer;0.761
artist;0.719
genre;0.467
1The Crystal City :: Orson Scott Card author
artist;0.625
author;0.617
writer;0.583
1Horatio Allen :: William Maxwell before predecessor;0.629 before;0.475
1Basdeo Panday :: Trinidad &Tobago birthplace deathPlace;0.658
birthplace;0.658
nationality;0.330
1Bob Nystrom :: Stockholm birthplace cityOfBirth;0.677 birthplace;0.513
1Beccles railway station :: Suffolk borough district;0.772
borough;0.770
friend;0.749
54
Knowledge Enabled Information and Services Science
Manual Evaluation strategy (UMLS)
poisoning, fluoride::teeth[finding_site_of] finding_site_of 1
polyneuritis, endemic::vitamin b 1[associated_with] has_form 0
polyp of cervix nos (disorder)::768 polyps[associated_with] associated_with 1
polyp of cervix nos (disorder)::neck of uterus[location_of] finding_site_of 1
polyp of colon::benign neoplasms[related_to] associated_with 0.5
brain::brain contusion [has_location]associated_morphology_of 0.25
brain::brain ischemia [has_finding_site] location_of 0.5polyp of colon::gastrointestinal tract, nos[is_primary_anatomic_site_of_disease] location_of 0.5
polyvesicular vitelline tumor::gamete structure (cell structure)[is_normal_cell_origin_of_disease]
is_normal_cell_origin_of_disease 1
proptosis::apert syndrome[has_manifestation] has_manifestation 1
55
Knowledge Enabled Information and Services Science
Manually evaluated precision for different confidence values
56
Knowledge Enabled Information and Services Science
Manually evaluated precision, confidence > 0.5 (on UMLS – MedLine corpus)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
UMLS - Pert - Ent
57
Knowledge Enabled Information and Services Science
Summary Model Creation
• Using background knowledge in the form of a fact corpus and a text corpus, we can suggest new facts/propositions
• Possible to try all combinations of known concepts (e.g. Read-the-Web project), but huge validation backlog
• Letting users drive the model creation focuses the creation on the parts that are of common interest
Willingness to help validate facts
58
Knowledge Enabled Information and Services Science
Circle of knowledge on the Web
59
What is knowledge?
How do we turn propositions/beliefs into knowledge?
How do we acquire knowledge?
Part 3
Knowledge Enabled Information and Services Science
Evaluation and Use
Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, Pankaj Mehra and Amit Sheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, to appear in WebScience 2010
60
Current Work
Knowledge Enabled Information and Services Science
Explicit evaluation
• “Evaluate for evaluation’s sake”– Domain-experts rank the value of a proposition– Committees of experts and/or laymen vote on
the correctness of propositions
61
Knowledge Enabled Information and Services Science
Explicit evaluation in the Semantic Browser
• The user can vote on facts• Some facts are presented randomly• Most facts are presented after the user (by
browsing) showed interest in– The full triple– Subject/Object of the triple
62
Knowledge Enabled Information and Services Science
Implicit evaluation
• Evaluation that does not explicitly involve a vote on the extracted information
• Use the Wisdom of the Crowds• Users show support for a proposition by
performing an action• Every action taken on a piece of
information is recorded and analyzed• The cumulative behavior of the users gives
an indication of which propositions are correct or interesting
63
Knowledge Enabled Information and Services Science
Implicit evaluation in the Semantic Browser
• The user simply searches and browses• The search history and the click-stream
provide information about whether a page transition using an extracted triple was successful
• Assumption: on average, a successful trail-browsing session includes valid triples
• Problem: requires extensive use
64
Knowledge Enabled Information and Services Science
Implicit evaluation in the Semantic Browser
65
1st triple
2nd triple
Triples
Knowledge Enabled Information and Services Science
Conclusion
• Creating domain models gives a way of selectively adding knowledge to a system
• We showed that it is possible to automatically create such models with high accuracy
• The models immediately impact users Willingness to help evaluate
• Evaluation becomes integral part in knowledge lifecycle
66
Knowledge Enabled Information and Services Science
?
67
Knowledge Enabled Information and Services Science
Thank you
68
Recommended