37
Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB

Contents of this Talk

Embed Size (px)

DESCRIPTION

Contents of this Talk. [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB. Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International - PowerPoint PPT Presentation

Citation preview

Contents of this Talk

[Used as intro to Genome Databases Seminar, 2002]

Overview of bioinformaticsMotivations for genome databasesAnalogy of virus reverse-eng to genome analysisQuestions to ask of a genome DB

Overview of Genome Databases

Peter D. Karp, Ph.D.

SRI International

[email protected]

www-db.stanford.edu/dbseminar/seminar.html

Talk Overview

Definition of bioinformatics

Motivations for genome databases

Computer virus analogy

Issues in building genome databases

Definition of Bioinformatics

Computational techniques for management and analysis of biological data and knowledge

Methods for disseminating, archiving, interpreting, and mining scientific information

Computational theories of biology

Genome Databases is a subfield of bioinformatics

Motivations for Bioinformatics

Growth in molecular-biology knowledge (literature)

Genomics

1. Study of genomes through DNA sequencing2. Industrial Biology

Example Genomics Datatypes

Genome sequences DOE Joint Genome Institute

511M bases in Dec 2001 11.97G bases since Mar 1999

Gene and protein expression data

Protein-protein interaction data

Protein 3-D structures

Genome Databases

Experimental data Archive experimental datasets Retrieving past experimental results should be faster than repeating the

experiment Capture alternative analyses Lots of data, simpler semantics

Computational symbolic theories Complex theories become too large to be grasped by a single mind The database is the theory Biology is very much concerned with qualitative relationships Less data, more complex semantics

Bioinformatics

Distinct intellectual field at the intersection of CS and molecular biology

Distinct field because researchers in the field should know CS, biology, and bioinformatics

Spectrum from CS research to biology service

Rich source of challenging CS problems

Large, noisy, complex data-sets and knowledge-sets

Biologists and funding agencies demand working solutions

Bioinformatics Research

algorithms + data structures = programs

algorithms + databases = discoveries

Combine sophisticated algorithms with the right content:

Properly structured Carefully curated Relevant data fields Proper amount of data

Goals of Systems Biology

Catalog the molecular parts lists of cells

Understand the function(s) of each part

Understand how those parts interact to produce the behavior of a cell or organism

Understand the evolution of those molecular parts

Analogy: Genome Analysis andVirus Analysis

Given: Virus binary executable file for known machine architecture

Reverse engineer the program Procedures Call graph Specifications for I/O behavior of the program and all procedures

Capture and publish an annotated analysis of the virus

Comparative analysis of related viruses

Genome Analysis

Example: M. tuberculosis genome

Given: 4.4Mbp of DNA (genome)

Infer: Molecular parts list of Mtb A model of the biochemical machinery of Mtb cell

DNA is a blueprint for the program of life

Start

4.4Mbyte binary program

4.4Mbp DNA sequence

Step 1

Distinguish code from data segmentsFind procedure boundaries

Distinguish coding from non-coding regions –Gene Finding

Step 2

Predict semantics of procedures

Predict gene functions

A B C D

Step 3

Predict procedure call graph

Predict biochemical and gene networks

A BC

D

A BC

D

A B C D

Step 4

Predict conditions under which procedures are invoked

Predict expression of network fragments

A BC

DQ R S

Step 5

Infer complete program specification

Formulate dynamic cellular simulation

Step 6

Internet publishing of structured program annotation with explanations, references,commentary

Internet publishing of structured genome annotation with explanations, references, commentary

Step 7

Comparative analysis of virusesEvolutionary relationships among viruses

Comparative analysis of genomesEvolutionary relationships among genomes

Step 8

Identify measures to disable virus or prevent its spread

Identify target proteins for anti-microbial drug discovery

A BC

DQ R S

Database of Viruses

Create a database that stores Binaries for all viruses All annotation of virus programs by different investigators Comparative analyses

Support Remote API access Click-at-a-time browsing

Reference on Major Genome Databases

Nucleic Acids Research Database Issue

http://nar.oupjournals.org/content/vol30/issue1/ 112 databases

Questions to Ask of a New Genome Database

What are Database Goals andRequirements?

How many users?What expertise do users have?

What problems will database be used to solve?

What is its Organizing Principle?

Different DBs partition the space of genome information in different dimensions

Experimental methods (Genbank, PDB)

Organism (EcoCyc, Flybase)

What is its Level of Interpretation?

Laboratory data

Primary literature (Genbank)

Review (SwissProt, MetaCyc)

Does DB model disagreement?

What are its Semantics and Content?

What entities and relationships does it model?

How does its content overlap with similar DBs?How many entities of each type are present?Sparseness of attributes and statistics on

attribute values

What are Sources of its Data?

Potential information sources Laboratory instruments Scientific literature

Manual entry Natural-language text mining

Direct submission from the scientific community Genbank

Modification policy DB staff only Submission of new entries by scientific community Update access by scientific community

What DBMS is Employed?

None

Relational

Object oriented

Frame knowledge representation system

Distribution / User Access

Multiple distribution forms enhance accessBrowsing access with visualization toolsAPIPortability

What Validation Approaches areEmployed?

None

Declarative consistency constraints

Programmatic consistency checking

Internal vs external consistency checking

What types of systematic errors might DB contain?

Database Documentation

Schema and its semanticsFormatAPIData acquisition techniquesValidation techniquesSize of different classesCoverage of subject matterSparseness of attributesError rates

Relationship of Database Field toBioinformaticsScientists generally ignorant of basic DB

principles Complex queries vs click-at-a-time access Data model Defined semantics for DB fields Controlled vocabularies Regular syntax for flatfiles Automated consistency checking

Most biologists take one programming classEvolution of typical genome databaseFiner points of DB research off their radar screenHandfull of DB researchers work in bioinformatics

Database Field

For many years, the majority of bioinformatics DBs did not employ a DBMS

Flatfiles were the rule Scientists want to see the data directly Commercial DBMSs too expensive, too complex DBAs too expensive

Most scientists do not understand Differences between BA, MS, PhD in CS CS research vs applications Implications for project planning, funding, bioinformatics

research

Recommendation

Teaching scientists programming is not enoughTeaching scientists how to build a DBMS is

irrelevantTeach scientists basic aspects of databases and

symbolic computing Database requirements analysis Data models, schema design Knowledge representation, ontologies Formal grammars Complex queries Database interoperability