Identifying structural templates using alignments of designed sequences Stefan M. Larson Pande Group Biophysics Program December, 2002 [email protected]

Identifying structural templates using Identifying structural templates using alignments of designed sequencesalignments of designed sequences

Stefan M. LarsonPande GroupBiophysics ProgramDecember, 2002 [email protected]

Structure prediction & sequence spaceStructure prediction & sequence space

ASDJFHLKASDLFHASDFLHUHOUIQWEQWEONBLQWEROKJASDFPOIQWERUHOQWEORSADFLKJIJ

ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFGQWOIEGTXKNBVALHERTASDLFHIUWERHSDDFGHKBJDDURMWOFBMFERTJFGJDKEGORTMVIRGHRT

ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFG

ASDJFHLKASDASDFLHUHOUIQWEONBLQWERASDFPOIQWERQWEORSADFLK

Multiple sequence alignments aid Multiple sequence alignments aid comparative protein modelingcomparative protein modeling

• 1 in 3 sequences are recognizably related to at least one protein structure.

• A significant fraction of the remaining 2/3 have solved structural homologues, but they are not recognized through sequence similarity searching techniques.

• Marti-Renom et al. (2000)

• Multiple sequence alignments greatly improve the efficacy and accuracy of almost all phase of comparative modeling.

• Venclovas (2001)

Computational protein designComputational protein design

Native structure

Iterative refinementNew sequence

Large scale sequence Large scale sequence generationgeneration

200,000Total sequences generated

4,000Processors available

80 daysTotal time of data collection

26,400Total backbone variants

264Total structures

“Reverse BLAST” study:

““Reverse BLAST”: Reverse BLAST”: finding templates for finding templates for

comparative modelingcomparative modeling

Larson SM, Garg A, Desjarlais JR, Pande VS. (2003) Proteins: Structure, Function, and Genetics

Experiment: Sequence qualityExperiment: Sequence quality

ASDFASDFASDFASFDSAFASDFASDFAFASDFASDFASDFAFHFDIDIFERIDKDADHFYWTEFHHASDASDFYEFHGASDFVADHFYWTEFHHASDASDFYEFHGASDFVDGSAHDYERCNDFKAKSLKALSDFPLAK

Design BLAST E<0.01

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1be7/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ctx/tracel.html


http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ame/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ae3/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1bpi/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ag6/tracel.html




Results: Sequence qualityResults: Sequence quality

1E-17

1E-16

1E-15

1E-14

1E-13

1E-12

1E-11

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

0.0001

0.001

0.01

0.1

1

10

0 25 50 75 100 125 150 175 200 225

Designed sequence profile (ranked by E-value)

E-v

alu

e o

f b

est

PD

B h

it

0

5

10

15

20

25

30

Ave

rag

e id

enti

ty t

o n

ativ

e se

qu

ence

(%

)

Method: “Reverse BLAST”Method: “Reverse BLAST”

THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF

BLAST E<0.01








Designed Sequences Hypothetical Proteins Structural Templates




http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ame/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ae3/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1bpi/tracel.html





Do the designed sequences help?Do the designed sequences help?

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

2 3 4 5 6 7 8 9 10

E-value threshold (-log(E))

hit

s w

ith

seq

uen

ce a

lig

nm

ent

: h

its

wit

ho

ut

0

20

40

60

80

100

120

140

160

Tota

l u

niq

ue

hit

s

Correctly identified structural templates

fold-increase in # of templates

fold-increase in # of genes

total hits

0

5

10

15

20

25

30

35P

yroc

occu

s h

orik

osh

ii S

ulfo

lobu

s so

lfata

ricu

s T

herm

op

lasm

a a

cid

ophi

lum

T

herm

op

lasm

a vo

lca

niu

m

Tre

pone

ma

pal

lidum

H

elic

oba

cte

r p

ylo

ri 2

669

5

Hel

ico

bact

er

pyl

ori

J99

C

ampy

loba

cte

r je

jun

i M

yco

bact

eriu

m t

ube

rcul

osis

CD

C15

51

Myc

oba

cter

ium

tu

berc

ulos

is H

37R

v R

icke

ttsia

pro

wa

zeki

i C

hlam

ydop

hila

pne

um

iae

AR

39

Chl

amyd

oph

ila p

neu

mia

e C

WL0

29

Chl

amyd

oph

ila p

neu

mia

e J

138

M

yco

bact

eriu

m le

pra

e

Chl

amyd

ia m

urid

aru

m

Chl

amyd

ia tr

acho

ma

tis

Aqu

ifex

aeo

licus

M

yco

plas

ma

ge

nita

lium

M

yco

plas

ma

pn

eum

onia

e

Myc

opl

asm

a p

ulm

onis

S

tre

pto

cocc

us

pyo

gen

es

Mes

orh

izob

ium

loti

Met

han

oco

ccus

jann

asc

hii

Bor

relia

bur

gdo

rfe

ri D

eino

cucc

us

rad

iodu

ran

s U

reap

lasm

a u

real

ytic

um

H

alob

acte

rium

sp

C

aulo

bact

er c

resc

entu

s L

acto

cocc

us la

ctis

A

rcha

eog

lob

us fu

lgid

us

Pyr

ococ

cus

aby

ssi

Met

han

oba

cte

rium

the

rmo

auto

tro

phic

um

Nei

sser

ia m

en

ingi

tidis

MC

58

Nei

sser

ia m

en

ingi

tidis

Z2

491

H

aem

ophi

lus

influ

enza

e

Xyl

ella

fast

idio

sa

Buc

hne

ra s

p

Sta

phyl

ococ

cus

aur

eus

Mu5

0

Sta

phyl

ococ

cus

aur

eus

N31

5

Pas

teur

ella

mul

toci

da

The

rmo

toga

ma

ritim

a

Vib

rio

cho

lera

e B

acill

us s

ubtil

is

Pse

udo

mon

as

aeru

gin

osa

S

yne

choc

ystis

PC

C6

803

E

sche

richi

a co

li O

157

H7

ED

L933

E

sche

richi

a co

li O

157

H7

E

sche

richi

a co

li K

12

Genome searched

Nu

mb

er

of

str

uc

tura

l te

mp

late

s id

en

tifi

ed

Remote homology detectionRemote homology detection

Optimizing structural diversityOptimizing structural diversity

0

10

20

30

40

50

60

70

80

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

RMSD of structural ensemble (Angstroms)

(%)

0

1

2

3

4

5

6

Seq

uen

ce e

ntr

op

y

sequence entropy

prediction accuracy

prediction coverage

mean pairwise %ID

mean native %ID

Future workFuture work

• Compare “reverse BLAST” to other remote homology detection approaches (3D-PSSM, HHMER, etc).

• Retrodict CASP targets, especially those which were not successfully predicted by comparative modeling.

• Increase the coverage and accuracy of the designed sequence sets.

CollaboratorsCollaborators

Stanford University• Amit Garg• Dr. Vijay Pande

Harvard University• Jeremy England

Xencor, Inc.• Dr. John Desjarlais

http://www.xencor.com/

Documents

Identifying structural templates using alignments of designed sequences Stefan M. Larson Pande Group Biophysics Program December, 2002 [email protected]