77
Learning how antibodies are drafted and revised Frederick “Erick” Matsen Fred Hutchinson Cancer Research Center @ematsen http://matsen.fredhutch.org/ with Trevor Bedford (FH), Connor McCoy, Vladimir Minin (UW), and Duncan Ralph (FH)

Learning how antibodies are drafted and revised

Embed Size (px)

Citation preview

Learning how antibodies are drafted and revised

Frederick “Erick” Matsen

Fred Hutchinson Cancer Research Center

@ematsenhttp://matsen.fredhutch.org/

with Trevor Bedford (FH), Connor McCoy, Vladimir Minin (UW), and Duncan Ralph (FH)

Jenner’s 1796 vaccine

 

Where are we 200 years later?

RV144 HIV trial: 2003-200926,676 volunteers enrolled16,395 volunteers randomized125 infections$105,000,000 and 6 years

 

Prospective studies are expensive, slow, and entail complex moral issues.This does not lend itself to rapid vaccine development.

 

How might we guide vaccine development without disease exposure?

Vaccines manipulate the adaptive immune system

 

 

What can we learn from antibody-making B cells without battle-testing them through disease exposure?

Antibodies bind antigensAntigen

Light chain

Heavy chain

Too many antigens to code for directly

≈∞⋯ ∞∞

B cell diversification processV genes D genes J genes

Affinitymaturation

Somatic hypermutation

VDJ rearrangement

includingerosion and

nontemplatedinsertion

AntigenNaive B cell

Experienced B cell

What germline really looks like (Eichler and Breden groups)

Big aim: reconstruct from memory reads

ACATGGCTC...

ATACGTTCC...

TTACGGTTC...

ATCCGGTAC...

ATACAGTCT...

...

reality

...inference

Why reconstruct B cell lineages?

...

1. Vaccine design

This one is really good.How can we elicit it?

Why reconstruct B cell lineages?

...

1. Vaccine design

immunogen 1

immunogen 2

Why reconstruct B cell lineages?

...

1. Vaccine design

?

2. Vaccine assay

Why reconstruct B cell lineages?

...

1. Vaccine design

3. Evolutionary analysis to learn about underlying mechanisms

2. Vaccine assay

Goal 1: find rearrangement groups

ACATGGCTC...

ATACGTTCC...

TTACGGTTC...

ATCCGGTAC...

ATACAGTCT...

...

reality

...rearrangement groups

VDJ annotation problem: from where did each nucleotide come?

Somatic hypermutation

Sequencing primerSequencing error

3’V deletion

VD insertion

5’D deletion

3’D deletion5’J deletion

DJ insertion

Biological process

Sequencing

Inference

G

 

 

This is a key first step in BCR sequence analysis.

Data: Illumina reads from CDR3 locus

Somatic  hypermut ation

Sequencing primerSequencing error

3’V deletion

VD insertion

5’D deletion

3’D deletion5’J deletion

DJ insertion

Biological process

SequencingG

Total of about 15 million unique 130nt sequences from memory B cellpopulations of three healthy individuals A, B, and C.

“Thread” reads onto structureV genes D genes J genes

...

...

...

HMM intro: dishonest casino

6 6

HMM intro: dishonest casino

6 6

1-p

1-p

p

HMM intro: dishonest casino

6 6

1-p

1-p

p

6 6

HMM intro: dishonest casino

6 6

1-p

1-p

p

6 6

p1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p

p

1-p 1-p 1-p 1-p1-p

1-p 1-p 1-p 1-p1-p

p p p pp

1-p

1-p

p

...

...

...

...

1-p

1-p

p

1-p 1-p 1-p 1-p1-p

1-p 1-p 1-p 1-p1-p

p p p pp

1-p

1-p

p

...

...

...

...

1-p

1-p

V genes D genes J genes

...

...

...

V genes D genes J genes

...

...

...

V genes D genes J genes

...

...

...

Detour: write HMM inference package 

We wanted to use HMMoC by G Lunter (Bioinf 2007)… then tried extending StochHMM by Lott & Korf (Bioinf 2014)…

but it ended up being a complete rewrite by Duncan to make ham.

 

Takes HMM description in concise & intuitive YAML format (for CpG example, 440 chars for ham vs 5,961 for HMMoC XML)slightly faster and more memory efficient than HMMoCcontinuous integration via Docker

 

Then write BCR annotation package:

https://github.com/psathyrella/ham

https://github.com/psathyrella/partis

What are probabilities?V genes D genes J genes

...

...

...

Distributions are reproducibly weird!

bases0 5 10

frequency

0.0

0.1

0.2

0.3

0.4

IGHV270*12  V 3' deletion

ABC

IGHV270*12  V 3' deletion

bases0 5 10

frequency

0.0

0.1

0.2

0.3

0.4

IGHD114*01  D 5' deletion

ABC

IGHD114*01  D 5' deletion

bases0 5 10

frequency

0.0

0.2

0.4

0.6

IGHD727*01  D 3' deletion

ABC

IGHD727*01  D 3' deletion

bases0 5 10

frequency

0.00

0.05

0.10

0.15

0.20

IGHJ4*02  J 5' deletion

ABC

IGHJ4*02  J 5' deletion

Distributions are reproducibly weird!

position200 250

mutation freq

0.0

0.1

0.2

0.3

0.4

IGHV323D*01

ABC

IGHV323D*01

position200 250

mutation freq

0.0

0.2

0.4

0.6

IGHV333*06

ABC

IGHV333*06

Only insertions look simple

bases0 5 10 15

frequency

0.00

0.05

0.10

0.15

VD insertion

ABC

VD insertion

bases0 5 10

frequency

0.0

0.1

0.2

DJ insertion

ABC

DJ insertion

Simulate sequences to benchmark 

 

Somatic hypermutation

Sequencing primerSequencing error

3’V deletion

VD insertion

5’D deletion

3’D deletion5’J deletion

DJ insertion

Biological process

Sequencing

Inference

G

 

Simulation code independent from inference code.

Incorporating this complexity is good

hamming distance0 5 10 15

frequency

0.0

0.1

0.2

0.3

HTTNpartis (k=5)partis (k=1)ighutiliHMMunealignigblastimgt

HTTN

but there are still a number of errors.

Remember goal: find rearrangement groups

ACATGGCTC...

ATACGTTCC...

TTACGGTTC...

ATCCGGTAC...

ATACAGTCT...

...

reality

...rearrangement groups

Say we are given two sequences

1-p

p

1-p

2 ×

2 ×

Double rollof a single die

per turn

1-p

p

1-p

1-p

p

1-p

+

Two independentdie rolling games

vs.

Double roll Pair HMM↔

p

1-p 1-p 1-p 1-p1-p

1-p 1-p 1-p 1-p1-p

p p p pp

1-p

1-p

p

...

...

...

...

1-p

1-p

Do two sequences come from a singlerearrangement event?

 

The forward algorithm for HMMs gives probability of generatingobserved sequence from a given HMM:x

 

P(x) = P(x; σ),∑paths σ

 

probability of generating two sequences and from the same paththrough the HMM (summed across paths).

P(x, y) = P(x, y; σ),∑paths σ

x y

V genes D genes J genes

...

...

...

Do sets of sequences come from a single rearrangement event?

 

=P(A ∪ B)P(A)P(B)

P(A ∪ B | single rearrangement)P(A, B | independent rearrangements)

 

 

Use this for agglomerative clustering; stop when the ratio < 1.

Preliminary simulation

 

Integrate out annotation uncertainty and win.

Goal 2: how are antibodies revised?

First, investigate BCR mutation patterns

affinitymaturation

antigennaive B cell

experienced B cell

clonalexpansion

somatic hypermutation

Use two-taxon “trees” for model fittingnote: we know ancestral state within V, D, J.

VV DD JJ

IGN

OR

E

IGN

OR

E

IGN

OR

E

IGN

OR

E

 

Our “trees” have an observed read on the bottom and the corresponding“ancestral” germline sequence on top, connected by a branch,

representing some amount of divergence.

model fitGeneral Time ReversibleIndividual A Individual B Individual C

0.14

0.79

0.10

0.22

0.72

0.41

0.69

0.40

0.06

0.28

0.73

0.17

0.08

0.48

0.27

0.17

0.35

0.50

0.66

0.23

0.32

0.42

0.37

0.36

0.35

0.11

0.46

1.02

0.12

1.12

0.85

0.31

0.18

1.10

0.91

0.06

0.12

0.79

0.10

0.19

0.60

0.43

0.76

0.36

0.07

0.24

0.67

0.18

0.07

0.64

0.23

0.14

0.36

0.44

0.74

0.21

0.36

0.33

0.33

0.45

0.28

0.14

0.44

0.76

0.13

1.15

0.94

0.34

0.24

0.89

0.86

0.07

0.14

0.72

0.11

0.21

0.54

0.43

0.71

0.37

0.08

0.24

0.65

0.18

0.08

0.50

0.27

0.16

0.27

0.49

0.65

0.16

0.45

0.39

0.34

0.52

0.27

0.14

0.50

0.73

0.14

1.05

0.79

0.28

0.23

0.90

0.70

0.08

T

C

G

A

T

C

G

A

T

C

G

A

IGHV

IGHD

IGHJ

A G C T A G C T A G C Tread

germ

line

Best model according to AIC/BIC… has different matrices and fixed rate multipliers

for the different segments.

V D J

Seq. 1

Seq. 2

Seq. 3

t2

t3

t1 rDt1rJt1

rDt2rJt2

rDt3rJt3

Mutation Model

Branch length distribution under this bestmodel

IGHD rate: 3.36IGHJ rate: 0.62

IGHD rate: 4.44IGHJ rate: 0.62

IGHD rate: 3.88IGHJ rate: 0.63

Individual A Individual B Individual C

0e+00

2e+05

4e+05

6e+05

8e+05

0.0e+00

5.0e+05

1.0e+06

1.5e+06

2.0e+06

0e+00

5e+05

1e+06

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4ML branch length

count

 

D segments evolve substantially faster than VJ segments evolve more slowly than VIndividual A has a higher mutational load.

Next consider selection (Goal 2 con’t)

affinitymaturation

antigennaive B cell

experienced B cell

clonalexpansion

somatic hypermutation

 

AAC AAG

GTGGTC

more likely

less likely

In antibodies

 

CCA CCTPro Pro

Thr Ile

ATCACC

synonymous

nonsynonymous

For selection

ὡ� ὡ� AAC AAG

GTGGTC

more likely

less likely

In antibodies

Would like per-site selection inference 

ω ≡ ≡dN

dS

rate of non-synonymous substitutionrate of synonymous substitution

 

position200 250

mutation freq

0.0

0.1

0.2

0.3

0.4

IGHV323D*01

ABC

IGHV323D*01

Productive vs. out-of-frame receptors 

Each cell may carry two IGH alleles, but only one is expressed.

 

V D J

V D J

insertion thatdisrupts frame

ω ≡ ≡dN

dS

rate of non-synonymous substitutionrate of synonymous substitution

λSo

ut−

of−

fram

e

70 80 90 100

site (IMGT numbering)

0.1

1.0

individual A B C

 

 

Out-of-frame reads can be used to infer neutral mutation rate!

is a ratio of rates in terms of observedneutral process

ωl

: nonsynonymous in-frame rate for site

: nonsynonymous out-of-frame rate for site

: synonymous in-frame rate for site

: synonymous out-of-frame rate for site

λ(N−I)l l

λ(N−O)l l

λ(S−I)l l

λ(S−O)l l

 

 

=ωl

/λ(N−I)l λ

(N−O)l

/λ(S−I)l λ

(S−O)l

Renaissance count (Lemey,Minin… 2012)

TGGCCGCGAseq−5 CCTCAAATCACTCTATGGCCGCGA

seq−2 CCACAAATCACGTTA TGGCCGCGA

ArgPro Gln

Thr

Ile Thr L eu Trp Gln

Pro

seq−1 CCACAAACCACGTTA TGGCAG

seq−3

CGA

CCTCAAACCACTCTATGGCAGCGAseq−4 CCTCAAATCACTCTA

ACCATCATC

ATCACC

ACCATC

ATC

ATC

ACC

ATC

ATCACC

ACC

ACC

ATCATC

mutation historysample

Use sampledmutation histories to estimate rates...

but suchestimatescan be unstable.

Empirical Bayes regularization to stabilize estimates

Say we are doing a per-county smoking survey.

zero smokers? Really?

Use all of the data to fit prior distribution of smoking prevalence, thenwith given observations obtain per-county posterior.

Estimating selection coefficient ωl

: nonsynonymous in-frame rate for site

: nonsynonymous out-of-frame rate for site

: synonymous in-frame rate for site

: synonymous out-of-frame rate for site

λ(N−I)l l

λ(N−O)l l

λ(S−I)l l

λ(S−O)l l

 

 

=ωl

/λ(N−I)l λ

(N−O)l

/λ(S−I)l λ

(S−O)l

Overall IGHV selection map

0.1

1.0

10.0

75 80 85 90 95 100 105

me

dia

Individual A

050

100150200

75 80 85 90 95 100 105S ite (IMGT numbering)

cou

nt

purifying neutral diversifying

Distribution of classifications across IGHV genes

Distribution ofmedian estimates of ω

Similar across individualsIndividual A

050

100150200

75 80 85 90 95 100 105

cou

nt

Individual B

050

100150200

75 80 85 90 95 100 105

cou

nt

Individual C

Site (IMGT numbering)

050

100150

75 80 85 90 95 100 105

cou

nt

purifying neutral diversifying

antigen

light chain

purifying

neutral

diversifying

Conclusion 

B cell receptors are “drafted” and “revised” randomly, but

… with remarkably consistent deletion and insertion patterns… with remarkably consistent substitution and selection

 

We can learn about these processes using model-based inference.

 

Paper on annotation with partis will be up soon is up on arXivSelection analysis paper

Thank youTrevor Bedford, Connor McCoy, Vladimir Minin & Duncan RalphPhil Bradley for doing structural workMolecular work done by Paul Lindau in Phil Greenberg’s lab withsupport from Harlan Robins and Adaptive BiotechnologiesAdaptive Biotechnologies computational biology team

 

National Science Foundation and National Institute of HealthUniversity of Washington Center for AIDS Research (CFAR)University of Washington eScience InstituteW. M. Keck Foundation

 

Addenda

Measuring clustering agreementgood agreement:

bad agreement:

Cx

Cy

Cx

Cy

 

 

Intuition: “how much variability is there in the color for amongst theitems of a given color under ?

Cx

Cy

Mutual information IThink of cluster identity under for a uniformly selected point as a

random variable (similarly for and ):Cx

X Cy Y

I(X; Y ) = H(X) − H(X|Y )where is the entropy of (ignoring ), and is the

entropy of given the value for .H(X) X Y H(X|Y )

X Y

 

I(X; Y ) = p(x, y) log ( )∑y∈Y

∑x∈X

p(x, y)p(x) p(y)

 

AMI(U, V ) =MI(U, V ) − E{MI(U, V )}

max {H(U), H(V )} − E{MI(U, V )}

Estimates of the mutational process are quiteconsistent between individuals

(each point is a single entry for one of the matrices for a pair ofindividuals.)

Branch length differences between productive,unproductive

Unproductive rearrangements are more likely to be either: unchangedfrom germline, or more divergent.

Sites are generally under purifying selectionIndividual A

Individual B

Individual C

0

200

400

600

800

0

200

400

600

800

0

200

400

600

800

−1 0 1median log10(ω)

cou

nt

purifying diversifying neutral

cou

nt

cou

nt

Similar across individuals (ii)

Distribution of amino acidsbeginningof CDR3

selection for aromaticamino acids?Frequency: left of line = out-of-frame, right of line = in-frame

Stabilize with empirical Bayes regularizationAssume that , the substitution rate at site , comes from a Gamma

distribution with shape and rate :λl l

α β

∼ Gamma(α, β).λl

 

Model total substitution counts (sampled via stochastic mapping) for asite as Poisson with rate :λl

∼ Poisson( ),Cl λl

 

Fit and to all data, then draw rates from the posterior:α̂ β̂ λl

∣ ∼ Gamma( + , 1 + ).λl Cl Cl α̂ β̂

 

We extended this regularization to case of non-constant coverage.

Sequence countsstatus A B Cfunctional 4,139,983 4,861,800 3,748,306out-of-frame 533,919 794,845 558,246stop 104,525 169,423 112,901

Correlation between sequence and GTR matrix

 

Each dot is a pair of genes.

Simulation results for selection inference

● ● ● ● ● ● ● ● ●●

● ● ● ● ● ●● ● ● ●

●●

● ● ● ●● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ●

●● ●

● ● ● ● ● ● ● ● ● ● ●●

● ● ● ● ● ● ● ● ●●

●●

● ● ●

● ●●

●● ●

● ● ●

● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ●

●● ●

● ● ● ● ● ● ● ● ● ● ●●

● ● ● ● ● ● ● ● ●●

● ●●

● ●

● ●

● ● ●●

● ● ● ● ● ● ●●

●●

● ● ●

●● ●

●● ●●

●● ●

0.1

1.0

10.0

0 25 50 75 100site

ω

s ynonymouschang epos s ible?

● yesno

type●●

●●

●●

purifyingneutraldiversifying

0.00

0.25

0.50

0.75

1.00

0 25 50 75 100site

Pro

po

rtio

n

typeNS

0

250

500

750

1000

0 25 50 75 100site

cove

rag

e

Omega distribution

Random factsMean length of D segment in individual A’s naive repertoire is 16.61.Subject A’s naive sequences were 37% CDR3Divergence between the various germ-line V genes:> summary(dist.dna(allele_01, pairwise.deletion=TRUE, model='raw'))Min. 1st Qu. Median Mean 3rd Qu. Max.0.003846 0.201300 0.344600 0.304700 0.384900 0.539500