Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses Sixth International Conference on Bioinformatics (InCoB2007)

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis

of Influenza A Viruses

Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28th August 2007

Olivo Miotto

Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore

Tan Tin Wee Vladimir BrusicYong Loo Lin School of Medicine Cancer Vaccine Center National University of Singapore Dana-Farber Cancer Inst.

Page 2

Outline

Knowledge Aggregation in large-scale analysis

Semantic Technologies for Knowledge Aggregation

Task: Annotating the Influenza Dataset

XML-based structural rules

Rule-based knowledge restructuring

Discussion and Conclusions

Page 3

Outline







1

Page 4

Knowledge Aggregation:Scaling up BioinformaticsBioinformatic Analysis is current limited in scope

Usually single domain (single aspect) Mostly small datasets (single genes, or few sequences)

"Horizontal" scalability: connecting domains Multiple database sources, diversely purposed data Systemic and semantic heterogeneity Discovery by relationship analysis

"Vertical" scalability: analyzing large datasets Many thousands of records Diversity of geography, tissue types, host, etc. Discovery by comparative analysis

Curre

ntly,

data

set p

repa

ratio

n

is m

anua

l

Page 5

Horizontal Scalability

BioHaystackSemantic Web

BrowserIBM + MIT

Quan, D (2004): BioHaystack: Gateway to the Biological Semantic Web

www.w3.org/2004/Talks/0520-em-swa/WWW-2004-BioHaystack-W3C-track.ppt

Page 6

Vertical ScalabilityMutual Information Analysis

Metadata Selection

Identification of

Characteristic Sites

Page 7

Obstacles to Scalability

Heterogeneity of Biological DatabasesSystemic: access to data in different databases

Syntactic: data formats, use of free text

Structural: different table structures in different databases

Semantic: data with different meaning and intent

Semantic Heterogeneity is particularly insidiousData is rarely used in the way it was originally intended

Low level of end-use technical expertiseBiologists, not computer scientists

Excel spreadsheets, Web page “scraping”

Does not scale up

Page 8

Outline







2

Page 9

Knowledge Aggregation: Technology requirements

To enable large-scale Knowledge Aggregation we need a technology platform with Structural independence Structural adaptability

To support biological researchers we need a technology platform with Limited infrastructure needs Intutitiveness Easy interchange and transformation

Best current candidate: Semantic Technologies

Page 10

Semantic Technologies: XML

XML is a tried-and-tested self-descriptive encoding that support any data application

Has a standard software platform for parsing and transforming data

<struct_refCategory> <struct_ref id="1"> <db_name>UNP</db_name> <db_code>HEMA_IAZH3</db_code> <pdbx_db_accession>P11134</pdbx_db_accession> <entity_id>1</entity_id> <pdbx_seq_one_letter_code> GLFGAIAGFIENGWEGMIDGWYG </pdbx_seq_one_letter_code> <pdbx_align_begin>330</pdbx_align_begin> </struct_ref></struct_refCategory>

<struct_refCategory> <struct_ref id="1"> <db_name>UNP</db_name> <db_code>HEMA_IAZH3</db_code> <pdbx_db_accession>P11134</pdbx_db_accession> <entity_id>1</entity_id> <pdbx_seq_one_letter_code> GLFGAIAGFIENGWEGMIDGWYG </pdbx_seq_one_letter_code> <pdbx_align_begin>330</pdbx_align_begin> </struct_ref></struct_refCategory>

Page 11

Semantic Technologies: RDF

RDF defines a very simple universal data structure encoded in XML

53656488, Bukit Timah Road1/1/1975Tan Ah LianS885347

S658347243623127, Orchard Road25/12/1972Goh Ah BengS324567

SpousePostcodeStreetDOBNameID RDBMS

Table

ValuePropertySubject

25/12/1972dobS324567

S324567-homeaddressS324567

127, Orchard RoadstreetS324567-home

243623postcodeS324567-home

S658347spouseS324567

Goh Ah BengnameS324567RDF

XSame structure for

any kind of data!

Page 12

Semantic Technologies: Ontologies

Ontologies: vocabularies of concepts and properties that describe a field of knowledge

OWL technology allows user to define ontologiesShared ontologies allow interchange of data

Ontologies support REASONING by means of programs that Read RDF data, encoded using an ontology Apply rules that relate to the described properties Generate new knowledge from these rules

Page 13

Outline







3

Page 14

Study goals

Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences

Various types of analysis, e.g. Identify amino acid mutations sites that characterize

human-transmissible strains Compare the diversity of viral sequences over different

periods of time and geographical areas

Several Metadata fields requiredProtein name Subtype Isolate

Host Country Year

Manual Curation is not an Option!

Page 15

Good

Pretty Bad

Not so Good

Inconsistencies in GenBank records

Page 16

Experimental Approach

1. Retrieve all influenza A records from GenBank and Genpept in XML format, using ABK platform

Miotto O, Tan TW, Brusic V (2005) LNCS 3578, 398-405.

2. Use XML structural rules to extract, merge and reconcile the metadata from the records

3. Use RDF encoding and an Ontology to encode and structure the resulting metadata

4. Use a Reasoner with Semantic Rules to restructure the metadata, and make inferences that improve the consistency

Page 17

Outline







4

Page 18

Leveraging on XML

XML offers great advantages for extracting heterogeneous metadata Wide availability Popular encoding for source databases Standard processing software Independence from source schemas Query Language (XPath)

Some disadvantages Almost unreadable by humans Interpretation of semantics requires understanding the

schema

Page 19

Page 20

ABK Structural Rules

Concise visualization of XML as name/value tree

Familiar presentation ofmetadata for biologists

Point-and-click selectionof location and constraints

Automatic formation ofXML Structural Rule

Hierarchical valuereconciliation

Tabulated visualizationand manual curation

RDF storage and output

Page 21

Structural Rules for Influenza Analysis

Property Priority

1

2

1

2

3

1

2

3

1

2

3

4

1

2

3

4

5

6

1

2

3

4

5

/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolate']/GBQualifier_value

/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='organism']/GBQualifier_value


Xpath expression

/GBSeq/GBSeq_references/GBReference/GBReference_title

/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='note']/GBQualifier_value

/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolation_source']/GBQualifier_value

/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='strain']/GBQualifier_value

/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='isolation_source']/GBQualifier_value




/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='specific_host']/GBQualifier_value



/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='country']/GBQualifier_value

origin

year

/GBSeq/GBSeq_definition

/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='gene']/GBQualifier_value







proteinName

subtype

isolate

host

Applicable to GBXML (Genbank and Genpept)

Page 22

Database Performance

Yield

0%

10%

20%

30%

40%

50%60%

70%

80%

90%

100%

Subtype Isolate Country Host Year

Accuracy

0%

10%

20%

30%

40%

50%60%

70%

80%

90%

100%

Subtype Isolate Country Host Year

Genbank

Genpept

Genbank is more thoroughly

annotated than Genpept

Page 23

Rule performance

Subtype

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Isolate Name

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Host

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

rule

4

rule

4

0%10%20%30%40%50%60%70%80%90%

100%

rule1 rule2

Origin

rule

1

rule

1

rule

2

rule

2rule

3

rule

3

rule

4

rule

4

rule

5 rule

5

rule

6

rule

6

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Year

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

rule

4

rule

4

rule

5

rule

5

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Multiple rules often neededSome properties

are very fragmented

Page 24

Outline







5

Page 25

Semantic Metadata Restructuring

Semantic Structure GapGenbank semantics represents individual sequences

A single isolate can comprise multiple sequences

-> Sequences from same isolate can present metadata discrepancies

Semantic Restructuring Restructure metadata to relate sequences from the same

isolate

Implemented using Jena2 (http://jena.sourceforge.net/)

Native Jena rule-based reasoner

Jena OWL reasoner validates inferences against ontology

Page 26

Semantic Restructuring

CHINA

A/Duck/GD/1234/04

2004

Genbank:123456

genbankRef

isolate

origin

year

SequenceRecord

record-234567/nt

DnaSequence

dnaSequence

NS1

record-234567

proteinName A

CHINA

A/Duck/GD/1234/04

2004

isolate-a/duck/gd/1234/04

isolate

origin

year

IsolateRecord

NS1

Genbank:123456

genbankRef

SequenceRecord

record-234567/nt

DnaSequence

dnaSequencerecord-234567

proteinName

hasSequenceRecord

B

Semantics of GenBank

Restructured Semantics

Page 27

Restructuring Rules

[rule1: (?rec rdf:type vg:SequenceRecord)

(?rec vg:isolate ?isolateId)

normalizeIsolate(?isolateId, ?nIsoId)

uriConcat('urn:abk:isolate:', ?nIsoId, ?isolateUri)

->

(?isolateUri rdf:type vg:IsolateRecord)

(?isolateUri vg:hasSequenceRecord ?rec)

]

[rule2: (?isolateUri vg:hasSequenceRecord ?rec)

(?rec ?prop ?value)

oneOf(?prop, vg:isolate, vg:virusSubtype, vg:year,

vg:country, vg:hostOrganism)

->

(?isolateUri ?prop ?value)

]

[rule1: (?rec rdf:type vg:SequenceRecord)

(?rec vg:isolate ?isolateId)

normalizeIsolate(?isolateId, ?nIsoId)

uriConcat('urn:abk:isolate:', ?nIsoId, ?isolateUri)

->

(?isolateUri rdf:type vg:IsolateRecord)

(?isolateUri vg:hasSequenceRecord ?rec)

]

[rule2: (?isolateUri vg:hasSequenceRecord ?rec)

(?rec ?prop ?value)

oneOf(?prop, vg:isolate, vg:virusSubtype, vg:year,

vg:country, vg:hostOrganism)

->

(?isolateUri ?prop ?value)

]

Page 28

Semantic Validation identifies Inconsistencies

CHINA

A/Duck/GD/1234/04

NA

record-234567890

isolate

origin

proteinNameSequenceRecord

JAPAN

A/Duck/GD/1234/04

HA

record-345678901

isolate

origin

proteinNameSequenceRecord

A

CHINA

A/Duck/GD/1234/04

JAPAN

IsolateRecordisolate

origin

hasSequenceRecord

record-234567890 HArecord-345678901

SequenceRecord

proteinName

hasSequenceRecord

Isolate-a/duck/gd/1234/04

origin

SequenceRecord

NAproteinName

B

Multiple ValuesFor

Functional Property

Page 29

Isolate Restructuring

Number of Isolates Identified

0

500

1000

1500

2000

2500

3000

3500

1 2 3 4 5 6 7 8 9 10 11 >11

Sequence per isolate

Sequences Identified with Isolates

0

2500

5000

7500

10000

12500

15000

17500

20000

1 2 3 4 5 6 7 8 9 10 11 >11

Sequence per isolate

Full Genome studiesare main contributors

Page 30

Re-annotation Results

Isolate Annotations

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

isolate subtype year origin host

Corrections of Sequence Annnotation

0

200

400

600

800

1000

1200

isolate subtype year origin host

added

modified

Huge Manual Curation savings

Page 31

Outline






Discussion and Conclusions6

Page 32

Discussion - 1

Large-scale metadata recovery from public databases is difficult even for simple requirements

Relatively simple approaches such as structural rules can do most of the tedious work Accuracy can be further improved with machine learning

Semantic inferences can improve data quality Significant impact on manual curation task

Rules have more potential for intuitive end-user GUI than programming cf. email rules, firewall rules

Page 33

Discussion - 2

Semantic Technologies are suitable for bioinformatics metadata management today Limited infrastructure requirements Flexibility and extensibility of ontologies (Open World)

Enormous potential for analysis tool integration Build tools that are "semantically agnostic"

Reasoning currently computationally expensive Our simple reasoning tasks exceeded the power of a

current desktop when applied to 10,000's records Divide-and conquer strategies were effective, but require

manual work, and are not always applicable Reasoning services and computing grid can help

scalability, but only if easy to access

Page 34

Acknowledgements and Thanks

Institute of Systems Science, NUSFunding support for this conference

Prof. J Thomas August, Johns Hopkins University

AT Heiny, NUS

Partial Grant Support:

National Institute of Allergy and Infectious Diseases, NIHGrant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C

ImmunoGrid ProjectEC Contract FP6-2004-IST-4, No. 028069

Page 35

Metadata Extraction Ontology (fragment)

Sequence Record Class

Six Functional Properties

Documents

Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses Sixth International Conference on Bioinformatics (InCoB2007)