Upload
katrina-park
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis
of Influenza A Viruses
Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28th August 2007
Olivo Miotto
Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore
Tan Tin Wee Vladimir BrusicYong Loo Lin School of Medicine Cancer Vaccine Center National University of Singapore Dana-Farber Cancer Inst.
Page 2
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
Page 3
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
1
Page 4
Knowledge Aggregation:Scaling up BioinformaticsBioinformatic Analysis is current limited in scope
Usually single domain (single aspect) Mostly small datasets (single genes, or few sequences)
"Horizontal" scalability: connecting domains Multiple database sources, diversely purposed data Systemic and semantic heterogeneity Discovery by relationship analysis
"Vertical" scalability: analyzing large datasets Many thousands of records Diversity of geography, tissue types, host, etc. Discovery by comparative analysis
Curre
ntly,
data
set p
repa
ratio
n
is m
anua
l
Page 5
Horizontal Scalability
BioHaystackSemantic Web
BrowserIBM + MIT
Quan, D (2004): BioHaystack: Gateway to the Biological Semantic Web
www.w3.org/2004/Talks/0520-em-swa/WWW-2004-BioHaystack-W3C-track.ppt
Page 6
Vertical ScalabilityMutual Information Analysis
Metadata Selection
Identification of
Characteristic Sites
Page 7
Obstacles to Scalability
Heterogeneity of Biological DatabasesSystemic: access to data in different databases
Syntactic: data formats, use of free text
Structural: different table structures in different databases
Semantic: data with different meaning and intent
Semantic Heterogeneity is particularly insidiousData is rarely used in the way it was originally intended
Low level of end-use technical expertiseBiologists, not computer scientists
Excel spreadsheets, Web page “scraping”
Does not scale up
Page 8
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
2
Page 9
Knowledge Aggregation: Technology requirements
To enable large-scale Knowledge Aggregation we need a technology platform with Structural independence Structural adaptability
To support biological researchers we need a technology platform with Limited infrastructure needs Intutitiveness Easy interchange and transformation
Best current candidate: Semantic Technologies
Page 10
Semantic Technologies: XML
XML is a tried-and-tested self-descriptive encoding that support any data application
Has a standard software platform for parsing and transforming data
<struct_refCategory> <struct_ref id="1"> <db_name>UNP</db_name> <db_code>HEMA_IAZH3</db_code> <pdbx_db_accession>P11134</pdbx_db_accession> <entity_id>1</entity_id> <pdbx_seq_one_letter_code> GLFGAIAGFIENGWEGMIDGWYG </pdbx_seq_one_letter_code> <pdbx_align_begin>330</pdbx_align_begin> </struct_ref></struct_refCategory>
<struct_refCategory> <struct_ref id="1"> <db_name>UNP</db_name> <db_code>HEMA_IAZH3</db_code> <pdbx_db_accession>P11134</pdbx_db_accession> <entity_id>1</entity_id> <pdbx_seq_one_letter_code> GLFGAIAGFIENGWEGMIDGWYG </pdbx_seq_one_letter_code> <pdbx_align_begin>330</pdbx_align_begin> </struct_ref></struct_refCategory>
Page 11
Semantic Technologies: RDF
RDF defines a very simple universal data structure encoded in XML
53656488, Bukit Timah Road1/1/1975Tan Ah LianS885347
S658347243623127, Orchard Road25/12/1972Goh Ah BengS324567
SpousePostcodeStreetDOBNameID RDBMS
Table
ValuePropertySubject
25/12/1972dobS324567
S324567-homeaddressS324567
127, Orchard RoadstreetS324567-home
243623postcodeS324567-home
S658347spouseS324567
Goh Ah BengnameS324567RDF
XSame structure for
any kind of data!
Page 12
Semantic Technologies: Ontologies
Ontologies: vocabularies of concepts and properties that describe a field of knowledge
OWL technology allows user to define ontologiesShared ontologies allow interchange of data
Ontologies support REASONING by means of programs that Read RDF data, encoded using an ontology Apply rules that relate to the described properties Generate new knowledge from these rules
Page 13
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
3
Page 14
Study goals
Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences
Various types of analysis, e.g. Identify amino acid mutations sites that characterize
human-transmissible strains Compare the diversity of viral sequences over different
periods of time and geographical areas
Several Metadata fields requiredProtein name Subtype Isolate
Host Country Year
Manual Curation is not an Option!
Page 15
Good
Pretty Bad
Not so Good
Inconsistencies in GenBank records
Page 16
Experimental Approach
1. Retrieve all influenza A records from GenBank and Genpept in XML format, using ABK platform
Miotto O, Tan TW, Brusic V (2005) LNCS 3578, 398-405.
2. Use XML structural rules to extract, merge and reconcile the metadata from the records
3. Use RDF encoding and an Ontology to encode and structure the resulting metadata
4. Use a Reasoner with Semantic Rules to restructure the metadata, and make inferences that improve the consistency
Page 17
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
4
Page 18
Leveraging on XML
XML offers great advantages for extracting heterogeneous metadata Wide availability Popular encoding for source databases Standard processing software Independence from source schemas Query Language (XPath)
Some disadvantages Almost unreadable by humans Interpretation of semantics requires understanding the
schema
Page 19
Page 20
ABK Structural Rules
Concise visualization of XML as name/value tree
Familiar presentation ofmetadata for biologists
Point-and-click selectionof location and constraints
Automatic formation ofXML Structural Rule
Hierarchical valuereconciliation
Tabulated visualizationand manual curation
RDF storage and output
Page 21
Structural Rules for Influenza Analysis
Property Priority
1
2
1
2
3
1
2
3
1
2
3
4
1
2
3
4
5
6
1
2
3
4
5
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolate']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='organism']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='organism']/GBQualifier_value
Xpath expression
/GBSeq/GBSeq_references/GBReference/GBReference_title
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='note']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolation_source']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='strain']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolation_source']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='strain']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolate']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='organism']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='specific_host']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='strain']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolate']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='country']/GBQualifier_value
origin
year
/GBSeq/GBSeq_definition
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='gene']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='strain']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolate']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='organism']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='strain']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolate']/GBQualifier_value
/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='organism']/GBQualifier_value
proteinName
subtype
isolate
host
Applicable to GBXML (Genbank and Genpept)
Page 22
Database Performance
Yield
0%
10%
20%
30%
40%
50%60%
70%
80%
90%
100%
Subtype Isolate Country Host Year
Accuracy
0%
10%
20%
30%
40%
50%60%
70%
80%
90%
100%
Subtype Isolate Country Host Year
Genbank
Genpept
Genbank is more thoroughly
annotated than Genpept
Page 23
Rule performance
Subtype
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Isolate Name
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Host
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
rule
4
rule
4
0%10%20%30%40%50%60%70%80%90%
100%
rule1 rule2
Origin
rule
1
rule
1
rule
2
rule
2rule
3
rule
3
rule
4
rule
4
rule
5 rule
5
rule
6
rule
6
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Year
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
rule
4
rule
4
rule
5
rule
5
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Multiple rules often neededSome properties
are very fragmented
Page 24
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
5
Page 25
Semantic Metadata Restructuring
Semantic Structure GapGenbank semantics represents individual sequences
A single isolate can comprise multiple sequences
-> Sequences from same isolate can present metadata discrepancies
Semantic Restructuring Restructure metadata to relate sequences from the same
isolate
Implemented using Jena2 (http://jena.sourceforge.net/)
Native Jena rule-based reasoner
Jena OWL reasoner validates inferences against ontology
Page 26
Semantic Restructuring
CHINA
A/Duck/GD/1234/04
2004
Genbank:123456
genbankRef
isolate
origin
year
SequenceRecord
record-234567/nt
DnaSequence
dnaSequence
NS1
record-234567
proteinName A
CHINA
A/Duck/GD/1234/04
2004
isolate-a/duck/gd/1234/04
isolate
origin
year
IsolateRecord
NS1
Genbank:123456
genbankRef
SequenceRecord
record-234567/nt
DnaSequence
dnaSequencerecord-234567
proteinName
hasSequenceRecord
B
Semantics of GenBank
Restructured Semantics
Page 27
Restructuring Rules
[rule1: (?rec rdf:type vg:SequenceRecord)
(?rec vg:isolate ?isolateId)
normalizeIsolate(?isolateId, ?nIsoId)
uriConcat('urn:abk:isolate:', ?nIsoId, ?isolateUri)
->
(?isolateUri rdf:type vg:IsolateRecord)
(?isolateUri vg:hasSequenceRecord ?rec)
]
[rule2: (?isolateUri vg:hasSequenceRecord ?rec)
(?rec ?prop ?value)
oneOf(?prop, vg:isolate, vg:virusSubtype, vg:year,
vg:country, vg:hostOrganism)
->
(?isolateUri ?prop ?value)
]
[rule1: (?rec rdf:type vg:SequenceRecord)
(?rec vg:isolate ?isolateId)
normalizeIsolate(?isolateId, ?nIsoId)
uriConcat('urn:abk:isolate:', ?nIsoId, ?isolateUri)
->
(?isolateUri rdf:type vg:IsolateRecord)
(?isolateUri vg:hasSequenceRecord ?rec)
]
[rule2: (?isolateUri vg:hasSequenceRecord ?rec)
(?rec ?prop ?value)
oneOf(?prop, vg:isolate, vg:virusSubtype, vg:year,
vg:country, vg:hostOrganism)
->
(?isolateUri ?prop ?value)
]
Page 28
Semantic Validation identifies Inconsistencies
CHINA
A/Duck/GD/1234/04
NA
record-234567890
isolate
origin
proteinNameSequenceRecord
JAPAN
A/Duck/GD/1234/04
HA
record-345678901
isolate
origin
proteinNameSequenceRecord
A
CHINA
A/Duck/GD/1234/04
JAPAN
IsolateRecordisolate
origin
hasSequenceRecord
record-234567890 HArecord-345678901
SequenceRecord
proteinName
hasSequenceRecord
Isolate-a/duck/gd/1234/04
origin
SequenceRecord
NAproteinName
B
Multiple ValuesFor
Functional Property
Page 29
Isolate Restructuring
Number of Isolates Identified
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10 11 >11
Sequence per isolate
Sequences Identified with Isolates
0
2500
5000
7500
10000
12500
15000
17500
20000
1 2 3 4 5 6 7 8 9 10 11 >11
Sequence per isolate
Full Genome studiesare main contributors
Page 30
Re-annotation Results
Isolate Annotations
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
isolate subtype year origin host
Corrections of Sequence Annnotation
0
200
400
600
800
1000
1200
isolate subtype year origin host
added
modified
Huge Manual Curation savings
Page 31
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions6
Page 32
Discussion - 1
Large-scale metadata recovery from public databases is difficult even for simple requirements
Relatively simple approaches such as structural rules can do most of the tedious work Accuracy can be further improved with machine learning
Semantic inferences can improve data quality Significant impact on manual curation task
Rules have more potential for intuitive end-user GUI than programming cf. email rules, firewall rules
Page 33
Discussion - 2
Semantic Technologies are suitable for bioinformatics metadata management today Limited infrastructure requirements Flexibility and extensibility of ontologies (Open World)
Enormous potential for analysis tool integration Build tools that are "semantically agnostic"
Reasoning currently computationally expensive Our simple reasoning tasks exceeded the power of a
current desktop when applied to 10,000's records Divide-and conquer strategies were effective, but require
manual work, and are not always applicable Reasoning services and computing grid can help
scalability, but only if easy to access
Page 34
Acknowledgements and Thanks
Institute of Systems Science, NUSFunding support for this conference
Prof. J Thomas August, Johns Hopkins University
AT Heiny, NUS
Partial Grant Support:
National Institute of Allergy and Infectious Diseases, NIHGrant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C
ImmunoGrid ProjectEC Contract FP6-2004-IST-4, No. 028069
Page 35
Metadata Extraction Ontology (fragment)
Sequence Record Class
Six Functional Properties