23
Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: C lasses of R eciprocal S equence H omologs (CRSH) Samuel Handelman, Nelson Tong, Jon D. Luff, David P. Lee, André Lazar, Paul Smith, Prasanna Gogate, Rohan Mallelwar and John Hunt

Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Embed Size (px)

Citation preview

Page 1: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Functional Classification of PSI Proteins to Support High Throughput Biochemical

Characterization:Classes of Reciprocal Sequence

Homologs (CRSH)

Samuel Handelman, Nelson Tong, Jon D. Luff, David P. Lee, André Lazar, Paul Smith, Prasanna Gogate, Rohan Mallelwar and John Hunt

Page 2: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Bacterial physiology in the post-genome era• Exponential growth in sequence information.

• Structural information is more difficult to obtain. Evolution is key to leveraging what we do know.

• Direct functional information is scarcer still: evolution and comparative studies are even more critical.

genome images from BacMap (UAlberta) and VirtualLaboratory; protein structure images from NESG (Columbia/Rutgers).

vs.

Page 3: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

3

Even today, most proteins are of unknown biochemical function

H. SapiensE. coli53%

“hypothetical”“putative”

“uncharacterized”or “unknown”

(01/23/08)

~4,200 proteins

~27,000 proteins

54%Neither identicalnor similar to any

experimentallyvalidatedprotein *

*Genome Information Integration Project And H-Invitational 2 (2007) Nucleic

Acids Research 36:D793-799

“Known” “Known”

• Closing this gap lays the groundwork for systems

biology.

Page 4: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

CRSH Goal: Group Functionally Equivalent Homologs.

CRSH Approach:

• Homology clusters contain multiple

distinct protein functions.

• Identify sub-clusters such that all

members have equivalent

function (in bacteria only).

Page 5: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Topic Overview

• CRSH: what they are, why they’re useful

• CRSH Web Interface, merits of mapping of TargetDB to protein functional groups

• Using CRSH and Gene Neighborhood to predict stable tertiary interactions.

Page 6: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Classes of Reciprocal Sequence Homologs(CRSHs)

Cluster based on BLAST scores; verify clusters on profile scores

Split into sub-clusters when multiple members come from a single organism (likely paralogs);

verify sub-clusters on profile scores

Merge sub-clusters into classes if more similar than expected after accounting for inter-

organism distances; verify final classes on profile scores

Predicted proteins from 474 fully sequenced bacterial genomes

} CRSHs likely same function~75,000

Main application: Gene neighborhood method. Calculate “co-localization” counts for all

CRSH pairs(# of times their genes are within 15 kB on chromosomes of fully diverged organisms)

Page 7: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Split into sub-clusters when multiple members come from a

single organism

M. tuberculosis RV0859

E. coli PaaJ

Indicates a pair of reciprocal closest homologs in their respective organisms

A. tumefaciens ATU0502

A. tumefaciens PcaF

beta-ketoadipyl CoA thiolases

acetyl-CoA acetyltransferases

Page 8: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

O

1

O

2

O

3

O

4

O

N…

O

1

O

3

O

3

O

1

O

1

Genome 1

Genome 2

Genome 3

Gene Neighborhood PreviewCourtesy Marco Punta

Each Octagon represents a CRSH

“Co-localized” = within 15 kB

O

3

• Stronger neighborhood conservation => better function predictions.

• Insight into function of unknown proteins.

Page 9: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

A Fixed Homology Threshold Fails to Reliably Segregate Functionally Equivalent Proteins

Frequency Distribution of Mean %ID in CRSH

0

0.05

0.1

0.15

0.2

0 25 50 75 100

Mean %ID

Fre

qu

ency

• Tremendous range in sequence conservation with more or less equivalent conservation of function.

0.00 0.25 0.50 0.75 1.00 1.25 1.500.0

0.1

0.2

0.3

0.4

Orthologs

Paralogs

Length-Normalized Blast Bit-Score

P (

Ea

ch G

en

e N

eig

hb

or

is C

on

serv

ed

)

Page 10: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Like Rost clusters, but for function

• Based on sequence information, you can conclude that two proteins have the same structure, even if you don’t know the structure.

• We’re working towards an analogous scheme for protein function, but each functional group needs it’s own cutoff.

• We propose to do this especially for proteins whose function we do not yet know.

.

Number of residues aligned

Pair

wis

e se

quen

ce id

entit

y

100

75

50

25

0

Sequence identityimplies

structuralsimilarity !

Don't know region

Graph Courtesy Burkhard Rost

Page 11: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

• We have developed a web interface for these CRSH, which is meant for use by experimentalists.

• Presently hosted in India (at http://61.8.141.68:8080/Columbia/), will be hosted at the NESG (at www.orthology.org), where CRSH pages will be available for each entry in targetDB.

• The CRSH Pages that follow have been mapped to targetDB, so that biologists working in the centers can access them directly.

Page 12: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

• Within 2 mos. we hope for a direct link from the PSI TargetDB gateway to the CRSHs.

• CRSHs already have links to biocyc, a leading bacterial physiology database; links coming to other functional genomics databases.

• A consensus domain architecture schematic will appear shortly.

Page 13: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

• The applet on the left provides a graphical display of the phylogenetic distribution. In the near future, we’ll add the info from targetDB to this applet and to the table below.

• Known complexes in biocyc are targets for structural genomics efforts to solve multi-protein structures.

• The genetically co-localizing CRSH are promising secondary targets, as I will explain…

Page 14: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Gene Neighborhood Hypothesis Generation

With suggested applications in structural genomics and functional genomics

ORRational ideas have consequences for action;

reason necessarily has a constructive function.

Page 15: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Known Stable Complexes Strongly Correlate with Gene Neighborhood

• For every pair of CRSH for which complex-membership data is available in biocyc, we count the instances where the two CRSH appear in a putative operon together.

• These counts correlate strongly with well-established, well-studied, stable and definitive physical complexes (drawn in this case from biocyc).

• These Probabilities are overestimated due to the methods used.

0

0.2

0.4

0.6

0.8

1

0 50 100

Co-localization counts (logarithmic bins)P

(CR

SH

to

ge

the

r in

sta

ble

co

mp

lex

)

All Hetero-Complexes

Heterodimers Only

Page 16: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Gene Neighborhood has some Correlation with Small Molecule Interaction Partners

• For each CRSH, we extract from biocyc a set of known small molecule interaction partners (ligands, substrates, products, etc.) We excluded very common partners (water, phosphate, ATP, etc.)

• Because proteins together in operons are often part of the same metabolic pathways or respond to similar chemical signals, it is reasonable to extrapolate small molecule interactions to the conserved gene neighbors.

• There is a definite correlation. This graph is preliminary – it is likely an underestimate.

0

0.2

0.4

0.6

0.8

1

Aggregate Co-localization counts for CRSH/Small Molecule

P (

Kno

wn

Inte

ract

ion

betw

een

CR

SH

M

embe

r an

d S

mal

l Mol

ecul

e)

A

Page 17: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

• This view, which is still in beta, gives the known small-molecule interactions of all of the gene neighbors for a given CRSH, weighted to reflect the strength of gene neighborhood conservation.

• As well as providing a starting point for interaction screening, this can make the functional insights provided by the gene neighborhood method more accessible.

Page 18: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Salvage Pipeline• For structural genomics targets which have been

cloned and are soluble, but which have failed to crystallize, we introduce a parallel pipeline to salvage them by adding “known” or predicted protein or small molecule binding partners.

• Bonus biology: whole greater than sum of parts.

Crystallizewithout Partner

Crystallize with Partner

Page 19: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Concluding Remarks

• We are eager to add links to PSI resources to our CRSH pages – they are intended to facilitate collaboration between structural and functional genomics, in particular.

• Functional information can improve the impact of structural genomics efforts, and may provide new salvage pathways for difficult targets.

Page 20: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Thank youJohn “The Jersey Eliminator” HuntPaul “Schmitty” SmithGreg “Cassis” BoelSai “Full Nelson” TongMarco “The Shark” PuntaBurkhard “Wrecking Ball” RostPrasanna “Crackerjack” GogateRohan “The Punisher” MallelwarJon “JD” Luff Liang “Red, White and Thunder” TongHoward “Hurricane” ShumanDana “Steel Toe” Pe’erHarmen “H-Bomb” BussemacherLarry “The Tank” ChasinDre “Enter the Dragon” LazarDavid “Intravenous” LeeGirish “Bone Breaker” RaoStephanie “Bronx” WongDiana “1-2-3” FlynnGeorge “El Pato Loco” OldanAllison “Grid Iron” FayJordi “El Chupacabra” BanachJohn “Steel” DworkinEtay “Aces” ZivChris “Fireball” WigginsGerwald “Sunshine” JoglCal “Howitzer” LobelYongzhao “Downtown” ShaoDavid “Finger of Death” DraperGae “Knuckles” MonteleoneMike “The Red Baron” BaranJohn “Mountain Man” EverettThe Hunt Lab, The NESGAmerican Heart Association, CF Foundation, NSF.

Page 21: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Consistency in CRSH sequence divergence levels between remote phyla

EACH DOT IS A CRSH

25 45 65 8525

45

65

85

D. radiodurans with B. subtilis a.a. %IDwith binomial standard error

E.

coli

wit

hS

. el

on

gat

us

a.a.

%ID

wit

h b

ino

mia

l st

and

ard

err

or

0.5 1.0 1.5 2.00.5

1.0

1.5

2.0

D. radiodurans with B. subtilislength-normalized blast bit score

E.

coli

wit

hS

. el

on

gat

us

len

gth

-no

rmal

ized

bla

st b

it s

core

Page 22: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Deviation from Evolutionary Consensus in Protein Complexes

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-1 -0.5 0 0.5 1

Spearman's Rho on Deviation from Consensus Distance

Fre

qu

ency

Interaction Pairs from Biocyc

Random Pairs from BiocycInteraction SetWith two S.D. againsthypothesis

Page 23: Functional Classification of PSI Proteins to Support High Throughput Biochemical Characterization: Classes of Reciprocal Sequence Homologs (CRSH) Samuel

Consistency in CRSH sequence divergence levels between remote phyla

0.5 1.0 1.5 2.00.5

1.0

1.5

2.0

D. radiodurans with B. subtilislength-normalized blast bit score

E.

coli

wit

hS

. el

on

gat

us

len

gth

-no

rmal

ized

bla

st b

it s

core

25 45 65 8525

45

65

85

D. radiodurans with B. subtilis a.a. %IDwith binomial standard error

E.

coli

wit

hS

. el

on

gat

us

a.a.

%ID

wit

h b

ino

mia

l st

and

ard

err

or

EACH DOT IS A CRSH