Upload
justin-macdonald
View
32
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Contents of this Talk. [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB. Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International - PowerPoint PPT Presentation
Citation preview
Contents of this Talk
[Used as intro to Genome Databases Seminar, 2002]
Overview of bioinformaticsMotivations for genome databasesAnalogy of virus reverse-eng to genome analysisQuestions to ask of a genome DB
Overview of Genome Databases
Peter D. Karp, Ph.D.
SRI International
www-db.stanford.edu/dbseminar/seminar.html
Talk Overview
Definition of bioinformatics
Motivations for genome databases
Computer virus analogy
Issues in building genome databases
Definition of Bioinformatics
Computational techniques for management and analysis of biological data and knowledge
Methods for disseminating, archiving, interpreting, and mining scientific information
Computational theories of biology
Genome Databases is a subfield of bioinformatics
Motivations for Bioinformatics
Growth in molecular-biology knowledge (literature)
Genomics
1. Study of genomes through DNA sequencing2. Industrial Biology
Example Genomics Datatypes
Genome sequences DOE Joint Genome Institute
511M bases in Dec 2001 11.97G bases since Mar 1999
Gene and protein expression data
Protein-protein interaction data
Protein 3-D structures
Genome Databases
Experimental data Archive experimental datasets Retrieving past experimental results should be faster than repeating the
experiment Capture alternative analyses Lots of data, simpler semantics
Computational symbolic theories Complex theories become too large to be grasped by a single mind The database is the theory Biology is very much concerned with qualitative relationships Less data, more complex semantics
Bioinformatics
Distinct intellectual field at the intersection of CS and molecular biology
Distinct field because researchers in the field should know CS, biology, and bioinformatics
Spectrum from CS research to biology service
Rich source of challenging CS problems
Large, noisy, complex data-sets and knowledge-sets
Biologists and funding agencies demand working solutions
Bioinformatics Research
algorithms + data structures = programs
algorithms + databases = discoveries
Combine sophisticated algorithms with the right content:
Properly structured Carefully curated Relevant data fields Proper amount of data
Goals of Systems Biology
Catalog the molecular parts lists of cells
Understand the function(s) of each part
Understand how those parts interact to produce the behavior of a cell or organism
Understand the evolution of those molecular parts
Analogy: Genome Analysis andVirus Analysis
Given: Virus binary executable file for known machine architecture
Reverse engineer the program Procedures Call graph Specifications for I/O behavior of the program and all procedures
Capture and publish an annotated analysis of the virus
Comparative analysis of related viruses
Genome Analysis
Example: M. tuberculosis genome
Given: 4.4Mbp of DNA (genome)
Infer: Molecular parts list of Mtb A model of the biochemical machinery of Mtb cell
DNA is a blueprint for the program of life
Step 1
Distinguish code from data segmentsFind procedure boundaries
Distinguish coding from non-coding regions –Gene Finding
Step 4
Predict conditions under which procedures are invoked
Predict expression of network fragments
A BC
DQ R S
Step 6
Internet publishing of structured program annotation with explanations, references,commentary
Internet publishing of structured genome annotation with explanations, references, commentary
Step 7
Comparative analysis of virusesEvolutionary relationships among viruses
Comparative analysis of genomesEvolutionary relationships among genomes
Step 8
Identify measures to disable virus or prevent its spread
Identify target proteins for anti-microbial drug discovery
A BC
DQ R S
Database of Viruses
Create a database that stores Binaries for all viruses All annotation of virus programs by different investigators Comparative analyses
Support Remote API access Click-at-a-time browsing
Reference on Major Genome Databases
Nucleic Acids Research Database Issue
http://nar.oupjournals.org/content/vol30/issue1/ 112 databases
What are Database Goals andRequirements?
How many users?What expertise do users have?
What problems will database be used to solve?
What is its Organizing Principle?
Different DBs partition the space of genome information in different dimensions
Experimental methods (Genbank, PDB)
Organism (EcoCyc, Flybase)
What is its Level of Interpretation?
Laboratory data
Primary literature (Genbank)
Review (SwissProt, MetaCyc)
Does DB model disagreement?
What are its Semantics and Content?
What entities and relationships does it model?
How does its content overlap with similar DBs?How many entities of each type are present?Sparseness of attributes and statistics on
attribute values
What are Sources of its Data?
Potential information sources Laboratory instruments Scientific literature
Manual entry Natural-language text mining
Direct submission from the scientific community Genbank
Modification policy DB staff only Submission of new entries by scientific community Update access by scientific community
Distribution / User Access
Multiple distribution forms enhance accessBrowsing access with visualization toolsAPIPortability
What Validation Approaches areEmployed?
None
Declarative consistency constraints
Programmatic consistency checking
Internal vs external consistency checking
What types of systematic errors might DB contain?
Database Documentation
Schema and its semanticsFormatAPIData acquisition techniquesValidation techniquesSize of different classesCoverage of subject matterSparseness of attributesError rates
Relationship of Database Field toBioinformaticsScientists generally ignorant of basic DB
principles Complex queries vs click-at-a-time access Data model Defined semantics for DB fields Controlled vocabularies Regular syntax for flatfiles Automated consistency checking
Most biologists take one programming classEvolution of typical genome databaseFiner points of DB research off their radar screenHandfull of DB researchers work in bioinformatics
Database Field
For many years, the majority of bioinformatics DBs did not employ a DBMS
Flatfiles were the rule Scientists want to see the data directly Commercial DBMSs too expensive, too complex DBAs too expensive
Most scientists do not understand Differences between BA, MS, PhD in CS CS research vs applications Implications for project planning, funding, bioinformatics
research
Recommendation
Teaching scientists programming is not enoughTeaching scientists how to build a DBMS is
irrelevantTeach scientists basic aspects of databases and
symbolic computing Database requirements analysis Data models, schema design Knowledge representation, ontologies Formal grammars Complex queries Database interoperability