View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Identifying structural templates using Identifying structural templates using alignments of designed sequencesalignments of designed sequences
Stefan M. LarsonPande GroupBiophysics ProgramDecember, 2002 [email protected]
Structure prediction & sequence spaceStructure prediction & sequence space
ASDJFHLKASDLFHASDFLHUHOUIQWEQWEONBLQWEROKJASDFPOIQWERUHOQWEORSADFLKJIJ
ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFGQWOIEGTXKNBVALHERTASDLFHIUWERHSDDFGHKBJDDURMWOFBMFERTJFGJDKEGORTMVIRGHRT
ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFG
ASDJFHLKASDASDFLHUHOUIQWEONBLQWERASDFPOIQWERQWEORSADFLK
Multiple sequence alignments aid Multiple sequence alignments aid comparative protein modelingcomparative protein modeling
• 1 in 3 sequences are recognizably related to at least one protein structure.
• A significant fraction of the remaining 2/3 have solved structural homologues, but they are not recognized through sequence similarity searching techniques.
• Marti-Renom et al. (2000)
• Multiple sequence alignments greatly improve the efficacy and accuracy of almost all phase of comparative modeling.
• Venclovas (2001)
Computational protein designComputational protein design
Native structure
Iterative refinementNew sequence
Large scale sequence Large scale sequence generationgeneration
200,000Total sequences generated
4,000Processors available
80 daysTotal time of data collection
26,400Total backbone variants
264Total structures
“Reverse BLAST” study:
““Reverse BLAST”: Reverse BLAST”: finding templates for finding templates for
comparative modelingcomparative modeling
Larson SM, Garg A, Desjarlais JR, Pande VS. (2003) Proteins: Structure, Function, and Genetics
Experiment: Sequence qualityExperiment: Sequence quality
ASDFASDFASDFASFDSAFASDFASDFAFASDFASDFASDFAFHFDIDIFERIDKDADHFYWTEFHHASDASDFYEFHGASDFVADHFYWTEFHHASDASDFYEFHGASDFVDGSAHDYERCNDFKAKSLKALSDFPLAK
Design BLAST E<0.01
Results: Sequence qualityResults: Sequence quality
1E-17
1E-16
1E-15
1E-14
1E-13
1E-12
1E-11
1E-10
1E-09
1E-08
1E-07
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
10
0 25 50 75 100 125 150 175 200 225
Designed sequence profile (ranked by E-value)
E-v
alu
e o
f b
est
PD
B h
it
0
5
10
15
20
25
30
Ave
rag
e id
enti
ty t
o n
ativ
e se
qu
ence
(%
)
Method: “Reverse BLAST”Method: “Reverse BLAST”
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
BLAST E<0.01
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
Designed Sequences Hypothetical Proteins Structural Templates
Do the designed sequences help?Do the designed sequences help?
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
2 3 4 5 6 7 8 9 10
E-value threshold (-log(E))
hit
s w
ith
seq
uen
ce a
lig
nm
ent
: h
its
wit
ho
ut
0
20
40
60
80
100
120
140
160
Tota
l u
niq
ue
hit
s
Correctly identified structural templates
fold-increase in # of templates
fold-increase in # of genes
total hits
0
5
10
15
20
25
30
35P
yroc
occu
s h
orik
osh
ii S
ulfo
lobu
s so
lfata
ricu
s T
herm
op
lasm
a a
cid
ophi
lum
T
herm
op
lasm
a vo
lca
niu
m
Tre
pone
ma
pal
lidum
H
elic
oba
cte
r p
ylo
ri 2
669
5
Hel
ico
bact
er
pyl
ori
J99
C
ampy
loba
cte
r je
jun
i M
yco
bact
eriu
m t
ube
rcul
osis
CD
C15
51
Myc
oba
cter
ium
tu
berc
ulos
is H
37R
v R
icke
ttsia
pro
wa
zeki
i C
hlam
ydop
hila
pne
um
iae
AR
39
Chl
amyd
oph
ila p
neu
mia
e C
WL0
29
Chl
amyd
oph
ila p
neu
mia
e J
138
M
yco
bact
eriu
m le
pra
e
Chl
amyd
ia m
urid
aru
m
Chl
amyd
ia tr
acho
ma
tis
Aqu
ifex
aeo
licus
M
yco
plas
ma
ge
nita
lium
M
yco
plas
ma
pn
eum
onia
e
Myc
opl
asm
a p
ulm
onis
S
tre
pto
cocc
us
pyo
gen
es
Mes
orh
izob
ium
loti
Met
han
oco
ccus
jann
asc
hii
Bor
relia
bur
gdo
rfe
ri D
eino
cucc
us
rad
iodu
ran
s U
reap
lasm
a u
real
ytic
um
H
alob
acte
rium
sp
C
aulo
bact
er c
resc
entu
s L
acto
cocc
us la
ctis
A
rcha
eog
lob
us fu
lgid
us
Pyr
ococ
cus
aby
ssi
Met
han
oba
cte
rium
the
rmo
auto
tro
phic
um
Nei
sser
ia m
en
ingi
tidis
MC
58
Nei
sser
ia m
en
ingi
tidis
Z2
491
H
aem
ophi
lus
influ
enza
e
Xyl
ella
fast
idio
sa
Buc
hne
ra s
p
Sta
phyl
ococ
cus
aur
eus
Mu5
0
Sta
phyl
ococ
cus
aur
eus
N31
5
Pas
teur
ella
mul
toci
da
The
rmo
toga
ma
ritim
a
Vib
rio
cho
lera
e B
acill
us s
ubtil
is
Pse
udo
mon
as
aeru
gin
osa
S
yne
choc
ystis
PC
C6
803
E
sche
richi
a co
li O
157
H7
ED
L933
E
sche
richi
a co
li O
157
H7
E
sche
richi
a co
li K
12
Genome searched
Nu
mb
er
of
str
uc
tura
l te
mp
late
s id
en
tifi
ed
Remote homology detectionRemote homology detection
Optimizing structural diversityOptimizing structural diversity
0
10
20
30
40
50
60
70
80
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
RMSD of structural ensemble (Angstroms)
(%)
0
1
2
3
4
5
6
Seq
uen
ce e
ntr
op
y
sequence entropy
prediction accuracy
prediction coverage
mean pairwise %ID
mean native %ID
Future workFuture work
• Compare “reverse BLAST” to other remote homology detection approaches (3D-PSSM, HHMER, etc).
• Retrodict CASP targets, especially those which were not successfully predicted by comparative modeling.
• Increase the coverage and accuracy of the designed sequence sets.
CollaboratorsCollaborators
Stanford University• Amit Garg• Dr. Vijay Pande
Harvard University• Jeremy England
Xencor, Inc.• Dr. John Desjarlais