Upload
bridgingworlds2008
View
2.008
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Bridging Worlds Conference 2008, SingaporeDay Two Track ThreeSpeaker 1- Christopher Baker
Citation preview
Christopher J. O. BakerInstitute for InfoComm Research,
A*STAR, Singapore
Ontology-centric
Knowledge Navigation
.. of the scientific literature
Motivation• Scientists typically need to integrate a spectrum of
information to successfully complete a task.
• On average a scientist or knowledge worker spends 1 day per week searching for, integrating and analyzing information, 50% of which is unstructured digital formats.
• Access to information structured according to explicit knowledge representations or taxonomiesis a fundamental concern of all scientists.
• Moving beyond keyword search requires tools thatprovide lexical matching to semantic, conceptual and contextual levels of information and this entails an infrastructure for indexing text segments according to domain-specific metadata
In the future ….• Users will be involved in the design of information systems
• Publishers will charge users for value added search: (who will build such search systems)
• Users will search across semantically integration data sources and data types (how to facilitate system creation / adoption)
• Knowledge driven systems - rapidly built and deployed with the engagement of domain experts in a knowledge engineering team
Literature-driven, Ontology-centricKnowledge Integration and Navigation
Ontology Population
Content delivery using expressive semantics
Text Mining
Ontology
Visual Query
500 documents, blogs, newsfeeds to browse
50 sentences
to read
Reasoning
W3C Semantic Web Technologies• URI / LSID• Ontologies• Reasoners• Query Languages• Web Services • Service Registries• Agents• Multi Agent Systems• Workflows Engines• GRID / Semantic GRID• Text Mining• Service Oriented Architecture
Controlled Vocabularies OntologiesGeneral
logicalconstraints
Terms/Glossary/Controlled
vocabularies
Thesauri“narrower term”Controlled vocabularies
Formalis-a
part-ofFrames
(properties)
Informalis-a
part-of
Formalinstance
Value restrictions
Catalog/ID
Capture knowledge: The meaning of important vocabulary (classes, properties/relations and instance data in a domain model). Common domain terminology
Basis for interoperability between information systems.
Make the content in information sources explicit.
Index and query model to a repository of information.
Lipid Ontology
Lipid Hierarchy
Concept Definitions
DL Axioms Graph fragment
> Implementation:OWL-DL
> DL Expressivity ALCHIQ
> Uses LIPIDMAPS systematic nomenclature
> 560 Named classes > 352 Lipid subclasses
71 Object properties (inc inv.)
> 4 Datatypeproperties
> Lipid instance: LIPIDMAPS systematic name
> Depth: 8 levelsDomain Knowledge vs
information system metadata
Ontologies Online
Ontology-centric knowledge architecture
• Content Delivery Platform - AutomatedDocument delivery from online databasesTools for conversion to text-minable text
• Text Mining - Customized and AutomatedRegular Expressions, Named Entities, Relations,
• Knowledge Engineering – Ontology CreationDomain Modeling / Customized Rapid Prototyping
• Ontology Population – Automated InstantiationSentences as instances / Co-occurrence and named relations (Rules)
Ontology-centric Knowledge Integration
Content Acquisition
Domainspecific raw text
Domian Ontology vs Mixed Metadata:a literature specification
Ontology Population Workflow• Ontology based information retrieval
applies NLP to link documents to existing ontologies
• Ontology-driven NLP - NLP that actively uses ontological resources for NLP tasks
• Ontological NLP - ontologies used as a knowledge base for NLP tasks while also exporting the results of NLP analyses into an ontology that can then subsequent semantic queries to the ontology using description logic reasoners and a box reasoning
• Ontology based NLP - the results of NLP are exported to another ontology, using external resources for text processing,
Witte etal. 2007
Text Mining• Class Instance Generation from full text
– Named entity recognition (gazetteer based)– Dictionary based matching of text tokens to domain
specific vocabularies i.e. (LipidBank, Lipidmaps, KEGG, IUPAC) and curated Swissprot terms and disease ontology of CGM
– Normalization and grounding to canonical names
• Relation Detection - Role Assertions: – Co-occurrence and Rule-based relation detection of binary
pairs from which knowledgebase instances are generated. Primary set of binary interactions mined from text:
– Lipid-Protein, Lipid-Disease, Protein-Disease– Domain specific library of curated biological relations.
Knowledgebase Instantiation1) Rule based identification of Sentences containing target keywords 2) Instantiation with JENA API http://jena.sourceforge.net/ for this purpose.
Target keywords found in sentences are instantiated to corresponding ontology class
• Lipid / Protein / Disease instances are instantiated to the respective ontology classes (as tagged by the gazetteer)
• Binary pairs instantiated to the respective Object Properties as role assertions • Sentences instantiated to the respective Data type properties.
For each lipid identified in a sentence the corresponding data are instantiated to the ontology from Lipid Data Warehouse records requiring no further text processing.
• Lipid - LIPIDMAPS Systematic Name and its associated • Lipid - IUPAC Name, Lipid – synonyms, Lipid - Database ID.
Knowledgebase Instantiation
Lipid Instance
Lipid Instance
Lipid Class Protein Instance
Rule Based Sentence Processing<Lipid> AND <Protein> AND LipidProteinInteraction-TriggerWord e.g. "interact", "bind", "mediate" <Lipid> AND <Disease> AND LipidDiseaseInteraction-TriggerWord e.g "involve", "cause"
Ontology instantiation
User
Knowledge Integration and QuerySearch Engine
docs tagged
withrelevant name
entities
Knowledge Navigation
vehicleOutput for end user
Baker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and Wenk MR. Towards ontology-driven navigation of the lipid bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.
NLP tagging
Instantiation Time: 22 seconds
92 Lipidmaps names instantiated to 35 classes (2.6 lipids per class)
Co-occurrence before rules 1356 Sentences, After rules 683 Interaction sentences
Sentences:
Cross link to 59 Lipidbank entries
52 IUPAC names, 412 exact synonyms, 6 broad synonyms, 319 protein names
92 Lipidmaps systematic names
After normalisation and grounding:
528 protein names
186 lipid names
141 papers contributed to ontology instantiation
121 papers with no lipid protein relations
Papers identified: 262
“Instantiated ontology”
Web content orFull text papers
User input query
Search Engine
Ontology instantiation
User
Knowledge Integration and Query
docs tagged
withrelevant name
entities
Knowledge Navigation
vehicleOutput for end user
Baker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and Wenk MR. Towards ontology-driven navigation of the lipid bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.
“Instantiated ontology”
NLP taggingUser input query Web content or
Full text papers
Knowlegator
Query Composition Panel
Ontology Content
Results Panel
Query Syntax
Query Engine DialogueConcept
PropertiesOverview
Domain expert
Informatician
Find documents and sentences describing proteins-lipid interaction and corresponding lipid synonyms.
Complex Query Generation
Pathway Discovery Algorithm
… paths between any object properties or a user defined object properties only e.g.protein interacts with protein
Finds transitive paths across the graph: between source and target concepts. Can define path length and result size
Pathway Knowledge Discovery
... across multiple relations
Results with semantic labelling Kanagasabai R. Low HS ,Ang WT, Wenk MR, Baker CJO.
Ontology-centric navigation of pathway information mined from text, Bio-Ontologies SIG: Knowledge in Biology, ISMB July 2008
2 concepts or keywords
Pathway Knowledge Discovery 2
Navigation of Cancer Pathways
1 search term (instance or concept) generates a list of natural language questions answerable by the ontology
and a direct link to answers
Ang WT, Kanagasabai R, Baker CJ. Knowledge Translation: Computing the query potential of bio-ontologies, Genome Informatics Workshop 2008 Submitted …..
Application Workflow
Semantic Technologies Architecture
Knowledge Services: Development
P h a s e 1 P h a s e 2
Navigation Paradigms
NLP &Text
Mining
Semantic Data
Integration
Knowledge Worker involved in Discovery
Databases
Multi-user involvement
Ontology EngineeringMaintenance
EvolutionQuality
Ontology
Domain Expert
Semantics Engineer Ontology Engineer
Text Mining Engineer
Annotation Services
AcknowledgementsSemantic Technology Group
Christopher J. O. BakerKanagasabi Rajaraman
Menaka RajapakseAnitha VeeramaniAng Wee Tiong
Alexander Garcia (Alumnus)
CollaboratorsMarkus R Wenk, NUSLow Hong-Sang, NUSChoo Kar Heng, I2R
Shoba Ranganathan NUSSuisheng Tan, I2R