57
April 12, 2004 April 12, 2004 Michael Conrad Memorial Lecture Michael Conrad Memorial Lecture The Future of The Future of Bioinformatics Bioinformatics Philip E. Bourne Philip E. Bourne The University of California The University of California San Diego San Diego [email protected] [email protected] http://www.sdsc.edu/pb http://www.sdsc.edu/pb

April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego [email protected]

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

The Future of The Future of BioinformaticsBioinformatics

Philip E. BournePhilip E. BourneThe University of California San DiegoThe University of California San Diego

[email protected]@ucsd.eduhttp://www.sdsc.edu/pbhttp://www.sdsc.edu/pb

Page 2: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Many of Michael’s contributions are Many of Michael’s contributions are now being more fully realized in the now being more fully realized in the fields of bioinformatics and systems fields of bioinformatics and systems biology. We will explore current and biology. We will explore current and

future trends in these fields to future trends in these fields to further appreciate Michael’s vision further appreciate Michael’s vision

Page 3: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

We have Come a Long Way…We have Come a Long Way…

Page 4: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Page 5: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

It will be through the increasing merger It will be through the increasing merger of computer science, computational of computer science, computational science, information science and the science, information science and the life sciences that Michael’s foresights life sciences that Michael’s foresights

will be fully appreciated. will be fully appreciated.

Large amounts of complex data puts Large amounts of complex data puts these disciplines on the same page these disciplines on the same page

and the book of bioinformatics can be and the book of bioinformatics can be written. It is therefore appropriate that written. It is therefore appropriate that

today we spend time looking at the today we spend time looking at the immediate future of bioinformaticsimmediate future of bioinformatics

Page 6: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Today’s OutlineToday’s Outline

We will address the following questions We will address the following questions from two perspectives – data complexity from two perspectives – data complexity and biological complexity:and biological complexity: How did bioinformatics get here?How did bioinformatics get here? What are the challenges today? What are the challenges today? Apology – Apology –

many illustrations are drawn from our own many illustrations are drawn from our own work in structural bioinformaticswork in structural bioinformatics

What will the short and long term future hold?What will the short and long term future hold?

Page 7: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

You Are Here

TIME

AN

YT

HIN

G

“The thing about change is that things will be different afterwards.”— Alan McMahon

Disclaimer - Plotting ChangeDisclaimer - Plotting Change

Page 8: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Rules of PredictionRules of Prediction

Looking back, everything appears to have Looking back, everything appears to have developed faster than realitydeveloped faster than reality

Looking forward, everything will develop Looking forward, everything will develop faster that you predictfaster that you predict

Hence, we are all very poor at predicting Hence, we are all very poor at predicting beyond the next 5 years – examples:beyond the next 5 years – examples: The Next Fifty Years : Science in the First Half of the Twenty-first The Next Fifty Years : Science in the First Half of the Twenty-first

CenturyCentury by by John BrockmanJohn Brockman (Editor) (Editor) CACM Volume 40 ,  Issue 2 CACM Volume 40 ,  Issue 2  (February 1997)  (February 1997)

"This is like deja vu all over again."

Page 9: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Can I even do 5 years?Can I even do 5 years?

Page 10: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Bourne Bioinformatics Editorial 1999 15(9):715 “Over the next 5 years there will be an estimated 10

major structural genomics efforts each yielding 200structures per year. While these efforts will deplete

regular structure determination efforts, improvementsin technology and a general expansion of the field

will continue to yield 50 structures per week worldwideoutside of the structural genomics initiatives.”

Net result 35,000 structures by 2005

"You can observe a lot just by watching."

There were 11,000 structures at the time of this prediction

Page 11: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

PDB Growth CurvePDB Growth Curve

Approx. 25,000 structures todayIn 2003 approx. 5,000 structures were deposited

Page 12: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

HistoryHistoryPredictions Can Be Good

Page 13: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

So Let Us Review the History of Bioinformatics So Let Us Review the History of Bioinformatics Thus Far – General ObservationsThus Far – General Observations

A scientific endeavor driven out of a paradigm shift in A scientific endeavor driven out of a paradigm shift in which biology became a data driven science – Today which biology became a data driven science – Today macromolecular structure data will be used to illustrate macromolecular structure data will be used to illustrate this paradigm shift this paradigm shift

A relatively new term for a scientific endeavor that has A relatively new term for a scientific endeavor that has been around much longerbeen around much longer

Medical informatics preceded it, and defined some of the Medical informatics preceded it, and defined some of the foundationsfoundations

A scientific endeavor that has gained from fundamental A scientific endeavor that has gained from fundamental developments is computer and information science e.g., developments is computer and information science e.g., algorithms, ontologies, Bayesian networks, simulation, algorithms, ontologies, Bayesian networks, simulation, neural networks, text mining and which in turn defines neural networks, text mining and which in turn defines new problem domains for computer sciencenew problem domains for computer science

Systems biology may overtake itSystems biology may overtake it

"Do you mean now?" -- When asked for the time. "

Page 14: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

A More Specific Chronology – Pre A More Specific Chronology – Pre 19701970

Bioinformatics (2003) 19 2176-2190Bioinformatics (2003) 19 2176-2190

1945 Biochemical Pathways - Horowitz1953 Structure of DNA – W&C1969 Genetic Variation

1953 Game Theory – Neumann and Morgenstern1959 Grammars – Chomsky1962 Information Theory – Shannon & Weaver1966 Cellular automata – Neuman

1962 Molecular Homology – Florkin1965 Evolutionary Patterns – Purling1966 Molecular Modeling - Levinthal1967 Phylogenetic Trees – Fitch1969 Properties – Ptitsyn1970 Dynamic Programming N&W1970 Adaptability - Conrad

Page 15: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

A More Specific Chronology – 1970’sA More Specific Chronology – 1970’sProblem DefinitionProblem Definition

Improved Sequence AlignmentsSanakoff

Structural PatternsAnd PropertiesRichards

Smith Waterman Algorithm

Exon/IntronsGilbert

Structure PredictionLevittChou and FasmanScheraga

Public Resources Dayhoff, PDB

Information processingIn molecular systemsConrad

Page 16: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

A More Specific Chronology – 1980’sA More Specific Chronology – 1980’sComputational Biology EmergesComputational Biology Emerges

Domains recognizedRashin

Tree of Life Emerges

FASTALipman & Pearson

ProfilesGribskov

Reductionism beginsThorntonSander

Neural netsHopfield

Molecular computingConrad

NanotechnologyDrexler

ClusteringShepard

Relational DatabasesNetworks – EMBLnet, BIONET

Page 17: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

A More Specific Chronology – 1990- A More Specific Chronology – 1990- Bioinformatics and Biotechnology Bioinformatics and Biotechnology

EmergeEmerge

Human Genome Human Genome ProjectProject

Internet/WebInternet/Web

Conrad, M., Adaptability theory as a guide for interfacing computers and human society, Systems Research 10, 3-23 (1993).

Page 18: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

2004 – Overview of the Current 2004 – Overview of the Current ChallengesChallenges

GenomesGene

ProductsStructure &

FunctionPathways &Physiology

~ Scientific Challenges - Deciphering the genome, mapping the genotype-phenotype relationships, dissecting organismic function, engineering organisms with altered functionality, figuring out complex traits and polymorphism, understanding physiology.

~ Algorithmic Challenges - comparisons of whole and partial genomes, metrics for similarity and homology, metabolic reconstruction, dissecting pathways, and whole cell modeling.

~ Computational Challenges - creating the informatics infrastructure, information integration, annotation, curation and dissemination of databases, development of parallel computational methods.

Page 19: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Bioinformatics Journal

0

200

400

600

800

1000

1200

1400

1997 1998 1999 2000 2001 2002 2003

Submissions

Bioinformatics Journal

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1997 1998 1999 2000 2001 2002 2003

Impact Factor

Data fromBioinformatics

Growth outweighs readershipparticularly among biologists

Sociological Challenge

Page 20: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Bioinformatics - A Vice Chancellor’s View

Biological Experiment Data Information Knowledge Discovery

Collect Characterize Compare Model Infer

Sequence

Structure

Assembly

Sub-cellular

Cellular

Organ

Higher-life

Year90 05

Computing Power

SequencingTechnology

Data1 10 100 1000 100000

95 00

Human Genome Project

E.ColiGenome

C.ElegansGenome 1 Small

Genome/Mo.ESTs

YeastGenome

Gene Chips

Virus Structure

Ribosome

Model Metaboloic Pathway of E.coli

Complexity Technology

Brain Mapping

Genetic Circuits

Neuronal Modeling

Cardiac Modeling

Human Genome

# People/Web Site

(C) Copyright Phil Bourne 1998

106 102 1

Page 21: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

A Data Centric View of the FutureA Data Centric View of the Future

Data complexityData complexity High throughput data collectionHigh throughput data collection Database vs literatureDatabase vs literature Bioinformatics as data driverBioinformatics as data driver Data representationData representation Data integrationData integration

"If you come to a fork in the road, take it."

Page 22: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

(a) myoglobin (b) hemoglobin (c) lysozyme (d) transfer RNA(e) antibodies (f) viruses (g) actin (h) the nucleosome (i) myosin (j) ribosome

Numbers and Complexity

Courtesy of David Goodsell, TSRI

Complexity is increasing

Page 23: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

High Throughput - The Structural Genomics Pipeline (X-ray Crystallography)

Basic Steps

Target Selection

Crystallomics• Isolation,• Expression,• Purification,• Crystallization

DataCollection

StructureSolution

StructureRefinement

Functional Annotation Publish

Bioinformatics Throughout the Process

Bioinformatics• Distant homologs • Domain recognition

AutomationBioinformatics• Empirical rules

AutomationBetter sources

Software integrationDecision Support

MAD Phasing Automated fitting

Bioinformatics• Alignments• Protein-protein interactions• Protein-ligand interactions• Motif recognition

No?

Page 24: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

An Aside on the Future of PublishingFull Description Captured as the Paper/Database is

Written/Deposited Does away with ...

… the p53 core domain structure consists of a ß sandwich that serves as a scaffold for two large loops and a loop-sheet- helix motif ... ----Science Vol.265, p346

1TSR

Corresponding structure from the PDB

?Oops!

ß sandwich? Where?Large loop? Which one??

Loop-sheet-helix???

Page 25: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

BioEditor - A DTD Driven BioEditor - A DTD Driven Domain Specific EditorDomain Specific Editor

http://bioeditor.sdsc.edu

Bioinformatics 2003 19(7) 897-898

Page 26: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

The Data - Bioinformatics CycleThe Data - Bioinformatics CycleResult – Computation and Experiment Result – Computation and Experiment

become More Synergisticbecome More Synergistic

Turn Data into Knowledge

Turn Knowledge into New Data Requirements

Data Bioinformatics

Page 27: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Deuterium Exchange Mass Spec to Predict StructureDeuterium Exchange Mass Spec to Predict StructureWoods, Baker et al.Woods, Baker et al.

DXMS

COREX

Target ProteinStructure Templates

CASP

X-ray or NMR

Sequence

Homology

Threadingab in

itio

others

Amino Acid

S

tabi

lity

)

Profile Match Method

Best Structure(s)

Page 28: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Biological RepresentationBiological Representation

The Gene Ontology changes everythingThe Gene Ontology changes everything Molecular functionMolecular function Biochemical processBiochemical process Cellular locationCellular location DAG – machine usableDAG – machine usable

The number of papers referencing the The number of papers referencing the gene ontology has increased dramatically gene ontology has increased dramatically in the last yearin the last year

Page 29: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Biological Data Representation Biological Data Representation Future Future

Tools to construct ontologies from free Tools to construct ontologies from free text?text?

Ontologies for details of function, protein-Ontologies for details of function, protein-protein interaction, protocols, complete protein interaction, protocols, complete pathway informationpathway information

Page 30: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Data IntegrationData Integration

Web Services – the Web Services – the holy grail of holy grail of

interoperability? interoperability?

Page 31: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Web ServicesWeb Services

Its not CORBA – biologists can do itIts not CORBA – biologists can do it You know longer have to remember where You know longer have to remember where

you left it – i.e. registriesyou left it – i.e. registries Platform independentPlatform independent Driver to force data providers to define and Driver to force data providers to define and

publish a detailed API publish a detailed API Compelling - introduces the prospect of Compelling - introduces the prospect of

global workflowglobal workflow

Page 32: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Perl Web Services Client ExamplePerl Web Services Client Example A small PERL program to access all Pubmed A small PERL program to access all Pubmed

abstracts containing the word ‘ferritin’abstracts containing the word ‘ferritin’use SOAP::Lite;

$ids_ref = SOAP::Lite

-> uri(‘http://server.location.edu/pdbWebServices’)

-> proxy(‘http://server.location.edu/pdbWebServices’)

-> pubmedAbstractQuery($ARGV[0])

-> result;

@ids = @($ids_ref);

Print “@ids\n”;

Mycomputer(1)% web_service.pl ferritin

1AEW 1AQO 1BCF 1BFR 1BG7 1DPS 1EUM 1FHA 1JGC 1JI5 1JIG 1MFR 1QGH 1RCC 1RCD 1RCE 1RCG 1RCI 1RYT 2FHA

Page 33: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

The Future -The Future -A Biological Complexity A Biological Complexity

PerspectivePerspective

Page 34: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Cell BiologyCell Biology

AnatomyAnatomy

PhysiologyPhysiology

ProteomicsProteomicsGenomicsGenomics

MedicinalMedicinal ChemistryChemistry

OrganismsOrganisms

OrgansOrgans

CellsCells

MacromoleculesMacromoleculesBiopolymersBiopolymers

Atoms & MoleculesAtoms & Molecules

SCIENTIFIC RESEARCH& DISCOVERY

REPRESENTATIVE DISCIPLINE

EXAMPLE UNITS

MRIMRI

HeartHeart

NeuronNeuron

StructureStructureSequenceSequence

ProteaseProteaseInhibitorInhibitor

ElectronElectronMicroscopyMicroscopy

Migratory Migratory SensorsSensors

VentricularVentricularModelingModeling

X-rayX-rayCrystallographyCrystallography

ProteinProteinDockingDocking

REPRESENTATIVE TECHNOLOGY

Technologies

TrainingInfrastructure

Simulation

Data

Page 35: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Exploring Biological Complexity Exploring Biological Complexity Requires:Requires:

We do NOT neglect the detailsWe do NOT neglect the details Synergy between theory and experiment Synergy between theory and experiment

which highlights the need for better which highlights the need for better algorithms and quality control algorithms and quality control

But….But…. We have existing and emerging We have existing and emerging

technologies to measure complex systemstechnologies to measure complex systems Provides the opportunity to address some Provides the opportunity to address some

of biology’s fundamental questionsof biology’s fundamental questions

Page 36: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Structure is a Useful Tool to Study Structure is a Useful Tool to Study Biological Complexity as Nature Biological Complexity as Nature has Provided a Helping Hand…has Provided a Helping Hand…

An average protein is 350 amino acids in length, An average protein is 350 amino acids in length, with 20 amino acids there are 20with 20 amino acids there are 20350350 possible possible proteins – way more than all the atoms in the proteins – way more than all the atoms in the universeuniverse

In actuality there may be only 2-5x10In actuality there may be only 2-5x1066 proteins proteins There are likely between 1-5000 unique foldsThere are likely between 1-5000 unique folds Fold is far more conserved than sequence and Fold is far more conserved than sequence and

permits us to look back farther in evolutionary permits us to look back farther in evolutionary time than sequencetime than sequence

Page 37: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

But.. much detail remains But.. much detail remains and our current and our current

methodologies fall short..methodologies fall short..

Consider structure comparison Consider structure comparison and alignment of the diverse and alignment of the diverse

protein kinasesprotein kinases

Page 38: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

An Example of a Structural Superfamily: An Example of a Structural Superfamily: The Protein Kinase-Like SuperfamilyThe Protein Kinase-Like Superfamily

Superfamily: not all eukaryotic or protein kinases: some homologues discovered in bacteria that phosphorylate antibiotics, others phosphorylate lipids Typical Kinase Core (c-Src, PDB ID: 2SRC)

SCOP grouping for kinases

1) Class: Alpha+Beta

2) Fold: Protein Kinase Catalytic Core

3) Superfamily: Protein Kinase Catalytic Core

4) Families:

a) Ser/Thr Kinases

b) Tyr Kinases

c) Atypical Kinases

d) Antibiotic Kinases

e) Lipid Kinases

Page 39: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Evolution of the Kinase Evolution of the Kinase Superfamily: Comparison of Superfamily: Comparison of Three Superfamily MembersThree Superfamily Members

•A: Casein kinase 1 (PDB ID: 1CSN)

•B: Aminoglycoside kinase (PDB ID: 1J7L)

•C: Phosphatidylinositol 3-kinase (PDB ID: 1E8X).

•D: The previous three structures with only their shared region superposed (1CSN: light blue, 1J7L: red, 1E8X: yellow).

•The three kinases share a minimal core required for ATP binding and phosphotransfer.

Page 40: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

An accurate alignment would An accurate alignment would allow us to look back farther in allow us to look back farther in

evolutionary time that sequence evolutionary time that sequence alone. Alignment algorithms alone. Alignment algorithms

need to simulate what humans need to simulate what humans can do and beyondcan do and beyond

Page 41: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

An Example of Manual vs. Automated with Combinatorial Extension An Example of Manual vs. Automated with Combinatorial Extension (CE)(CE)•The manual alignment can be used to better understand the limitations of our automated method

•Alignment of helix C of two tyrosine kinases

•Insulin Receptor Kinase (pdb id 1IR3)

•c-Src (pdb id 2SRC)

•Can be aligned with 40% ident, 3.0Å RMSD

•In Src, C-helix is displaced and rotated outward

•Rotation pushes n-terminal end of helix out very far from n-terminal end of IRK

•CE gaps a part of this (yellow), splitting helix, aligning part of IRK helix C with loop leading to helix C in Src

Orange: IRK, Blue: c-SrcYellow: CE gap region

Page 42: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Improving CEfam: Improving CEfam: Multiple Alignments Multiple Alignments with CEwith CE

•Example with strands 1 and 2 of kinase superfamily

•A: original

•B: optimal parameters

•C: manual

•Parameters also improved results with other protein superfamilies in visual analysis

•Just as sequence alignments are benchmarked against structure alignments, structure alignments should be benchmarked to manual results

•Improvement in optimization is now being folded into the next generation of CE

Page 43: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Quality ControlQuality Control

Consider an exampleConsider an example

The definition of domains from The definition of domains from

3-D structure3-D structure

Page 44: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

The 3D Domain Assignment Problem

Domain is a fundamental structural, functional and evolutionary unit of protein:

Compact

Stable

Have hydrophobic core

Fold independently

Perform specific function

Can be re-shuffled and put together in different combinations

Evolution works on the level of domain

Page 45: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Exact assignments of domains remains a difficult and unresolved problem.

There is no complete agreement among experts on domain assignment given a protein structure.

Expert methods agree on 80% of all existing manual assignments, the remaining 20% represent “difficult” cases

Expert assignment #1

Expert assignment #2

Expert assignment #3

Page 46: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Manual and automatic consensusagree

328 chains (77.3% of chains with consensus)

Automatic consensus only46 chains (10.9% of chains

with consensus)Manual consensus only 47 chains (11.1% of chains with consensus)

Automatic consensus and manual consensus disagree 3 chains (0.7% of chains with consensus)

Chains with manual consensus: 375 (80% of entire dataset)

Chains with automatic consensus: 374 (80% of entire dataset)

Chains with consensus (automatic or manual) : 424 (90.6% of entire dataset)

Manual vs. automatic consensuses: do they overlap?

Veretnik et al. 2004 JMB in press

Page 47: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

1cjaa1cjaa (actin-fragmin kinase, slime mold):(actin-fragmin kinase, slime mold): an unusual kinase an unusual kinase [complex interface][complex interface]

1 domain 1 domain + unassigned 4 domains

DALICATHSCOP, PDP, DomainParser

typical kinase

Page 48: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

Exemplar Bioinformatics ProblemsExemplar Bioinformatics ProblemsThe Next 5 Years…The Next 5 Years…

1. Full genome comparisons

2. Rapid assessment of polymorphic variations

3. Complete construction of orthologous and paralogous groups

4. Structure resolution of large assemblies/complexes

5. Dynamical simulation of realistic systems

6. Rapid structural/topological clustering of proteins

7. Protein folding

Page 49: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Exemplar Bioinformatics ProblemsExemplar Bioinformatics ProblemsThe Next 5 Years The Next 5 Years

8. Computer simulation of membrane insertion9. Simulation of cellular pathways/ sensitivity

analysis of pathways stoichiometry and kinetics

10 Comparison of complex networks and pathways

11 Deciphering the metabolome12 Integration and interpretation of data at different

biological scales – genomic to population13 Identification of biomarkers for use in diagnostic

medicine

Page 50: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

These problems will be dealt These problems will be dealt with by a new generation of with by a new generation of scientists comforable at both scientists comforable at both

the bench and computer. the bench and computer. Until then bioinforamticians Until then bioinforamticians

need to work hard to need to work hard to overcome the “high noon” overcome the “high noon”

problemproblem

Page 51: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

High Noon – A Working DefinitionHigh Noon – A Working Definition

12:00The cost:benefit ratio of entry to bioinformatics

tools and resources istoo high for the majority of biologists

Thus, those who could gain and

contribute most from the services provided are not users

Page 52: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

One Approach - MBTOne Approach - MBT Java toolkit for developing custom molecular Java toolkit for developing custom molecular

visualization applicationsvisualization applications

High-qualityHigh-qualityinteractiveinteractiverendering of: rendering of:

sequence sequence structurestructure functionfunction

http://mbt.sdsc.edu

Page 53: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

MBT ArchitectureMBT Architecture

Page 54: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Future - The Structure Should Future - The Structure Should be the User Interfacebe the User Interface

Ligand - What otherentries contain this?

Chain - What otherentries have chains with >90% sequence identity?

Residue - What is the environment of this residue?

Page 55: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Beyond 5 Years…Beyond 5 Years…

Transitional medicine Transitional medicine Personalized medicinePersonalized medicine Merger of medical-, chem- and bio- informaticsMerger of medical-, chem- and bio- informatics Societies that reflect thisSocieties that reflect this Training in cooperative in silico and Training in cooperative in silico and

experimental researchexperimental research Centers that reflect that training ie different to Centers that reflect that training ie different to

NCBI or EBINCBI or EBI

Think! How the hell are you gonna think and hit at the same time?" "

Page 56: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

Beyond 5 YearsBeyond 5 Years

Simulations used in the clinic settingSimulations used in the clinic setting Smart {genome} cardsSmart {genome} cards A ubiquitous life sciences Web that A ubiquitous life sciences Web that

permits views from populations to atomspermits views from populations to atoms

"I knew I was going to take the wrong train, so I left early."

Page 57: April 12, 2004 Michael Conrad Memorial Lecture The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu

April 12, 2004April 12, 2004 Michael Conrad Memorial LectureMichael Conrad Memorial Lecture

AcknowledgementsAcknowledgements

To all those who have chosen To all those who have chosen bioinformatics as a career and make the bioinformatics as a career and make the field so richfield so rich

Particularly those who do so for lesser Particularly those who do so for lesser rewards – the data providers and rewards – the data providers and annotatorsannotators

My group for the fun we had discussing My group for the fun we had discussing this topicthis topic

http://rinkworks.com/said/yogiberra.shtmlhttp://rinkworks.com/said/yogiberra.shtml

"I didn't really say everything I said."