35
Likelihood-based inference for partially observed processes, with applications to genetic sequence data, panel data and spatiotemporal data Edward Ionides University of Michigan, Department of Statistics Biostatistics Seminar Ohio State University Friday 12th April, 2019 Slides are online at http://dept.stat.lsa.umich.edu/ ~ ionides/talks/osu19.pdf

Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Likelihood-based inference for partially observedprocesses, with applications to genetic sequence

data, panel data and spatiotemporal data

Edward IonidesUniversity of Michigan, Department of Statistics

Biostatistics SeminarOhio State University

Friday 12th April, 2019

Slides are online athttp://dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf

Page 2: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Four motivating data analysis challenges

1 Time series analysis: cholera in Bangladesh.

Fitting nonlinear dynamic models to a single long time series.

2 Panel data analysis: dynamic variation in sexual contact rates.

Observations on a collection of units lead to a panel of time series.Analyzed together, the panel strengthens inferences available from anyone time series.

3 Genetic sequence data: HIV transmission within and betweendemographic groups.

Genetic sequences of pathogens can inform transmission relationshipsbetween infected hosts.

4 Spatiotemporal analysis: dengue in Rio de Janeiro.

Coupling between spatial locations leads to high-dimensional dynamics.

Page 3: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Commonalities between these four examples

All examples are partially observed Markov process (POMP) models.

A POMP model involves a latent dynamic process with the Markovproperty: the future given the current state does not depend on thepast.

Only noisy and incomplete measurements are available on the latentprocess.

Sequential Monte Carlo (SMC) algorithms provide a widely applicableapproach for low-dimensional systems.

Extensions to SMC are required for higher dimensional systems.

Page 4: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Monthly cholera deaths in Dhaka, Bangladesh, 1891-1940

0

2000

4000

1890 1900 1910 1920 1930 1940Year

Cho

lera

dea

ths

Page 5: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

LETTERS

Inapparent infections and cholera dynamicsAaron A. King1,2, Edward L. Ionides3, Mercedes Pascual1,4 & Menno J. Bouma5

In many infectious diseases, an unknown fraction of infectionsproduce symptoms mild enough to go unrecorded, a fact thatcan seriously compromise the interpretation of epidemiologicalrecords. This is true for cholera, a pandemic bacterial disease,where estimates of the ratio of asymptomatic to symptomaticinfections have ranged from 3 to 100 (refs 1–5). In the absenceof direct evidence, understanding of fundamental aspects of chol-era transmission, immunology and control has been based onassumptions about this ratio and about the immunological con-sequences of inapparent infections. Here we show that a modelincorporating high asymptomatic ratio and rapidly waningimmunity, with infection both from human and environmentalsources, explains 50 yr of mortality data from 26 districts ofBengal, the pathogen’s endemic home. We find that the asympto-matic ratio in cholera is far higher than had been previously sup-posed and that the immunity derived from mild infections wanesmuch more rapidly than earlier analyses have indicated. We find,too, that the environmental reservoir5,6 (free-living pathogen) isdirectly responsible for relatively few infections but that it may becritical to the disease’s endemicity. Our results demonstrate thatinapparent infections can hold the key to interpreting the patternsof disease outbreaks. New statistical methods7, which allow rig-orous maximum likelihood inference based on dynamical modelsincorporating multiple sources and outcomes of infection, season-ality, process noise, hidden variables and measurement error,make it possible to test more precise hypotheses and obtain unex-pected results. Our experience suggests that the confrontation oftime-series data with mechanistic models is likely to revise ourunderstanding of the ecology of many infectious diseases.

Cholera is a diarrhoeal disease caused by enteric infection with thebacterium Vibrio cholerae. Six of the seven cholera pandemics thathave swept the globe since 1817 originated in the low-lying, denselypopulated regions north of the Bay of Bengal, where the disease isendemic. Although much attention has been focused on cholera1,8,unsolved puzzles remain about its mode of transmission and the roleof host immunity in its dynamics. This is largely because, in regionswhere cholera is endemic, most cholera cases are mild or asympto-matic but the true extent of asymptomatic infection has been difficultto assess. Estimates of the ratio of asymptomatic to symptomaticcases vary greatly, and the importance of inapparent infections inthe dynamics of cholera outbreaks is unknown. To determine whatrole is played by inapparent infections, we used an approach thatallows indirect inference about unobserved variables.

A remarkably rich data set on the pattern of cholera epidemicsexists in the form of mortality records kept by the sanitary commis-sioners of the former British East Indian province of Bengal9. Thedata consist of monthly cholera death counts in each of 26 districtsover the period 1891–1940 (Supplementary Fig. 1). To analyse thesedata, we formulated a series of models incorporating knownor hypothesized mechanisms of transmission and immunity. A

parsimonious model for cholera dynamics is of susceptible–infectious–recovered–susceptible (SIRS) form (Fig. 1a). A novel fea-ture of this model is that it incorporates both transmission tied tohuman prevalence (using a traditional mass-action term) and trans-mission from an environmental reservoir (where the pathogen iscommonly living in aquatic environments)5,6,10–12. This model is a

1Department of Ecology and Evolutionary Biology, 2Department of Mathematics, 3Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA. 4Santa FeInstitute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA. 5Department of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, University ofLondon, London WC1E 7HT, UK.

a

l(t)

m M

M

g

g

ke . . . ke

ke

ke ke

ke

b

cl(t)

m

. . .

. . .

(1 − c)l(t)Y

r

c

l(t)

F+

H S

H S

H S

I R1 Rk

I R1 Rk

M

g ke ke

ke

m

I R1 Rk

Figure 1 | The mechanistic models used. a, SIRS model; b, two-path model;c, environmental-phage model. Births, related to the total population size H,are assumed to feed the pool of susceptibles, S. Individuals are susceptible toinfection when born. Exposure to the pathogen occurs at time-dependentrate l(t). c is the probability that an exposure leads to a contagious infection(class I). Note that when c 5 1 and r 5 ‘, the two-path model (b) reduces tothe SIRS model (a); when c , 1, some exposures result in short-termimmunity (class Y). Infected individuals die at an excess rate m and recoverat a rate c; the time an individual spends within the I class is exponentiallydistributed. We assume that an individual remains immune to reinfectionfor a duration gamma-distributed with mean 1/e and variance 1/ke2. Onceimmunity has waned, an individual re-enters the susceptible pool (S). Themeasured variable is monthly deaths, M. The mean duration of short-termimmunity is 1/r. Individuals in each class are subject to constantbackground mortality at rate 0.02 yr21. The force of infection, l(t), includesterms for environmental and human sources of infection and is assumed tovary seasonally. Because the seasonality of cholera dynamics in Bengal iscomplex, we used a semi-mechanistic approach: transmission was modelledby a flexible periodic function of time. In the environmental-phage model(c), as infected hosts shed pathogen, phage W builds up in the environmentand reduces transmissibility. The equations specifying these models aregiven in the Supplementary Equations.

Vol 454 | 14 August 2008 | doi:10.1038/nature07084

877

©2008 Macmillan Publishers Limited. All rights reserved

Competing POMPmodels(King et al., 2008)

S SusceptibleI InfectedRj RecoveredM MortalityH Population sizeY Asymptomatics in bΦ Phage in cλ force of infectionγ recovery rateε loss of immunitym cholera mortality

Page 6: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

2. Panel data on sexual contacts

Mathematical models of HIV transmission struggle to explainobserved incidence due to the low measured probability oftransmission per sexual contact.

The anomaly can be resolved by models that include individual-levelvariability in sexual behavior over time.

Romero-Severson et al. (2015) constructed behavioral models withvarious heterogeneities, both between individuals and withinindividuals over time. These models were fitted to behavioral paneldata.

Collections of POMP models with some shared parameters, but nodynamic interactions, are called PanelPOMP models.

Page 7: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Total sexual contacts in 6 month intervals

iterated filtering (22) implemented in pomp, version 0.43-4(29), running in R2.15.3 (30). Iterated filtering is a MonteCarlo algorithm which computes the maximum likelihoodestimate for partially observed Markov process models. Fil-tering is the numerical computation of estimating unobservedstates and evaluating the likelihood function for a partial-ly observed Markov process. Iterated filtering carries outmultiple filtering operations using a sequential Monte Carlofilter, with perturbations in the unknown model parametersdesigned so that successive filtering operations converge

toward the maximum likelihood estimate. The sequentialMonte Carlo method is a flexible nonlinear non-Gaussian fil-tering method, also known as the particle filter (31), in whichthe unknown distribution of the latent dynamic variables isrepresented by a Monte Carlo sample from this distribution(known as a swarm of particles). Successive iterations ofthe filtering process make successively smaller perturbationsto the parameters, with the heuristic that the optimization pro-cess is cooling toward a freezing point which is theoreticallyguaranteed to be a local maximum of the likelihood function

0.5 1.0 1.5 2.00.0

0.5

1.0

1.5

2.0

Follow-up Time, years

Ave

rage

Con

tact

Rat

e

0 5 10 15 200

100

200

300

400

500

A) B)

C) D)

Average Contact Rate

Cou

nt

0.5 1.0 1.5 2.00

20

40

60

80

Follow-up Time, years

No.

of C

onta

cts

−1.0 −0.5 0.0 0.5 1.00

20

40

60

80

100

120

Corrected Autocorrelation

Cou

nt

Figure 1. Features of a set of longitudinal data on rates of sexual contact among human immunodeficiency virus (HIV)-negative gay men in theUnited States, Centers for Disease Control and Prevention Collaborative HIV Seroincidence Study (1992–1995). A) Secular trend from the time ofenrollment in the cohort; B) average rate of sexual contact per month; C) rates of sexual contact over time; D) bias-corrected autocorrelation (WebAppendix 1). Black bars show autocorrelation >0, while gray bars show autocorrelation ≤0. The mean (0.076) and standard error (0.0094) of theautocorrelation imply a small but positive autocorrelation (95% confidence interval: 0.057, 0.094).

Dynamic Variation in Sexual Contact Rates 3

at University of M

ichigan on May 26, 2015

http://aje.oxfordjournals.org/D

ownloaded from

Time series for 15 units from a panel of 882 gay men who completeda 2 year longitudinal study.Sexual contacts were reported in various categories: oral, anal,protected, unprotected, etc. Here, we show total reported contacts.

Page 8: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

3. Infectious disease dynamics inferred from genetic dataSmith et al. · doi:10.NNNN/molbev/mst1 MBE

I0 I1 I2

J0 J1 J2

ρ0

γI0

ρ1

γI1

ρ2

γJ0γJ1

ψ

h

FIG. 2. A flow diagram showing the possible classes for infected individuals. The columns represent stage of disease: with

subscripts 0, 1 and 2 representing early, chronic, and AIDS stages respectively. The rows represent diagnosis status, with

the top row representing undiagnosed individuals, Ik, and the bottom row representing diagnosed individuals, Jk, where

k∈{0,1,2}. ρk are per capita rates of diagnosis and γc are rates of disease progression. Arrows out of classes that do not

flow into other classes represent the combined flow out of the infected population due to death and emigration.

to the intensive nature of the computations, further developments will be required to handle considerably

larger datasets. Some empirical results concerning how our GenSMC implementation scales with number

of sequences are given in the supplement (Section S2.3). We discuss applicability to the range of current

phylodynamic challenges in the discussion section.

A study on simulated data

Using the individual-based, stochastic model of HIV described above (Fig. 2), we simulated epidemics

conditional on observing 30 sequences. We set the length of the simulated sequences to be 100 bases. We

set parameters governing the rate of evolution at relatively high values to generate a high proportion of

variable sites. As computation scales with the number of variable sites, the computational effort in this

simulation study could be comparable to fitting real sequences of greater length. Parameters values and

their interpretations are specified in Tables 1 and 2. Algorithmic parameters are specified in Section S4.2.

Each simulated epidemic consisted of a transmission forest and a set of pathogen genetic sequences. We

randomly selected 5 epidemics to fit. Each dataset consists of two types of data: times of diagnoses and

pathogen genetic sequences. A representative simulated transmission forest and its associated pathogen

genetic sequences are shown in Fig. 3.

For each of the selected epidemics we ask two questions. First, when all other parameters are known,

is it possible to infer εI0and εI1

using only diagnosis times? Second, how does inference change when

we supplement the diagnosis data with pathogen genetic sequences? To perform this comparison we

estimated two likelihood surfaces for each epidemic: one using only the diagnosis likelihood, and one

using both the diagnosis likelihood and the genetic likelihood. We estimated each surface by using the

particle filter to compute a grid of likelihood estimates with respect to the two parameters of interest:

10

A flow diagram for HIV.

Ik classes represent undiagnosed infections.

Jk classes represent diagnosed infections.

k = 0, 1, 2 denotes early, chronic and AIDS stages.

Infection can come from within, or outside, the study population.

Genetic data give evidence on infectors as well as infectees.

Page 9: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

0

2

4

6

8

10

Tim

e

● ●

●●

14 5 2 9 1 8 6 24 21 3 15 11 4 19 18 23 20 7 13 16 27 28 22 26 29 30 17 25 12 10

●●●

●●

●●

●●

●●

●●

●I0 I1 I2 J0 J1 J2 Sequence

ACGT

A simulated HIV epidemic(Smith et al., 2017)

Top: phylogeny of observed se-quences.

Middle: simulated sequencedata from a fitted model.

Bottom: Transmission forestfor the simulated epidemic.

red: undiagnosed early infection

blue: undiagnosed chronic infection

green: diagnosed

Page 10: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Measles in 20 UK cities, 1944–1965

Leeds Manchester Liverpool Birmingham London

Bradford Hull Nottingham Bristol Sheffield

Northwich Bedwellty Consett Hastings Cardiff

Halesworth Lees Mold Dalton.in.Furness Oswestry

10

1000

10

1000

10

1000

10

1000

A few cities are above a critical community size for sustaining measles.

What is the effective transmission network between cities?

Page 11: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

A SpatPOMP for measles

S1 -µSE,1(t)ξ1(t)

E1-

µEII1 -

µIRR1

S2 -µSE,2(t)ξ2(t)E2

-µEII2 -µIR

R2

SU -µSE,U (t)ξU (t)

EU -µEI

IU -µIR

RU

A SpatPOMP is a POMP model with a collection of coupled units.

Here, units are cities, u = 1, . . . , U

Coupling arises because µSE,u(t) depends on I1, . . . , IU

Data are weekly reported case counts in each city.

Modeled using coupled over-dispersed Markov chains.

Page 12: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Innovations for general POMP models

New Monte Carlo optimization algorithms facilitate likelihoodmaximization for large partially observed Markov process (POMP)models: iterated filtering.

Iterated filtering algorithms optimize the likelihood using asequence of random parameter perturbations, with decreasingmagnitude. Sequential Monte Carlo (SMC) provides a tool fornumerical solution to this nonlinear filtering problem.Existing variations on expectation-maximization (EM) andMarkov chain Monte Carlo (MCMC) do not scale well for theseproblems.We are doing parametric inference. The main problem usinglikelihood or Bayesian methods is computational. If existingmethods worked computationally, there would be no problem!

A new perspective on likelihood-based inference via Monte Carloprofile likelihood.

Page 13: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Monte Carlo profile for genetic data on HIV dynamics

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

0 2 4 6 8

−60

0−

500

−40

0

prof

ile lo

g lik

elih

ood

φ

Figure 2: Profile likelihood for an infectious disease transmission parameter inferred fromgenetic data on pathogens. The smoothed profile likelihood and corresponding MCAP 95%confidence interval are shown as solid red lines. The quadratic approximation in a neighbor-hood of the maximum is shown as a blue dotted line.

the capabilities of our methodology, we present three high-dimensional POMP inference chal-lenges that become computationally tractable using MCAP.

4.1 Inferring population dynamics from genetic sequence data

Genetic sequence data on a sample of individuals in an ecological system has potential toreveal population dynamics. Extraction of this information has been termed phylodynamics(Grenfell et al., 2004). Likelihood-based inference for joint models of the molecular evolu-tion process, population dynamics, and measurement process is a challenging computationalproblem. The bulk of extant phylodynamic methodology has therefore focused on inferencefor population dynamics conditional on an estimated phylogeny and replacing the popula-tion dynamic model with an approximation, called a coalescent model that is convenient forcalculations backwards in time (Karcher et al., 2016). Working with the full joint likelihoodis not entirely beyond modern computational capabilities; in particular it can be done usingthe genPomp algorithm of Smith et al. (2016). The genPomp algorithm is an applicationof iterated filtering methodology (Ionides et al., 2015) to phylodynamic models and data.To the best of our knowledge, genPomp is the first algorithm capable of carrying out fulljoint likelihood-based inference for population-level phylodynamic inference. However, thegenPomp algorithm leads to estimators with high Monte Carlo variance, indeed, too high forreasonable amounts of computation resources to reduce Monte Carlo variability to negligi-bility. This, therefore, provides a useful scenario to demonstrate our methodology.

Figure 2 presents a Monte Carlo profile computed by Smith et al. (2016), with confidence

10

φ models HIV transmitted by recently infected, diagnosed individuals.The profile confidence interval is constructed by a cutoff that isadjusted for the Monte Carlo variability (Ionides et al., 2017).

A proper 95% cutoff is 2.35. Without Monte Carlo error, it is 1.92.Each point took approximately 10 core days to compute.Alternative approaches struggle with Monte Carlo likelihood error oforder 100 log units.

Page 14: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Previous uses of SMC for phylodynamic inference

SMC techniques have previously been used for inferring phylogenies(Bouchard-Cote et al., 2012), and for phylodynamic inferenceconditional on a phylogeny (Rasmussen et al., 2011).

These approaches avoid the high-dimensional, computationallychallenging problem of joint inference.

Several innovations were necessary to realize computationally feasibleSMC on models and datasets of scientific interest.

Dimension reduction: constructing the POMP model with geneticsequences only in the measurement model to reduce the dimension ofthe latent variables.Algorithm parallelization.Hierarchical sampling.Just-in-time construction of state variables.Restriction to a class of physical molecular clocks.Maximization of the likelihood using iterated filtering.

Page 15: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

The latent process for a GenPOMP

The latent Markov process, {X(t), t ∈ T}, with T = [t0, tend],models the population dynamics and also includes any other processesneeded to describe the evolution of the pathogen.

Suppose we can write X(t) =(T (t), P(t), U(t)

), where

T (t) is the transmission forest,P(t) is the pathogen phylogeny equipped with a relaxed molecularclock,U(t) represents the state of the pathogen and host populations.

For example, U(t) may categorize each individual in the hostpopulation into classes representing different stages of infection.

We suppose that {U(t), t ∈ T} is itself a Markov process.

The plug-and-play property (Breto et al., 2009; He et al., 2010)makes our methods applicable to any latent process for which asimulator exists.

Page 16: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Simulating a GenPOMP from t1 to t2

troot

t0 t1

T (t1)

Latent state at time t1

P(t1)

troot

t0 t1 t2

T (t2)

Latent state at time t2

4

3

2

1

P(t2)

Black: transmission forest, T (t). Blue: pathogen phylogeny, P(t).

Page 17: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Annotations for the GenPOMP schematic diagram

The branching pattern of the pathogen phylogeny mirrors that ofT (t) over the interval [t0, t1], so pathogen lineages are assumed tobranch exactly at transmission events. This simplifying assumptioncan be changed.

Randomness in the rate of evolution—a relaxed molecularclock—results in random edge lengths in P(t).

At 1© , an active node splits in two when a transmission event occurs.

At 2© , an active node becomes a dead node (×) when an infectedhost emigrates, recovers, or dies.

At 3© , an immigration event gives rise to a new active node with itsown root.

At 4© , a sequence node (•) is spawned when a sample is taken.

Page 18: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

GenSMC: Sequential Monte Carlo for a GenPOMPParticle 1 Particle 2 Particle 3

troot t0 t1 troot t0 t1 troot t0 t1

t1 t2 t1 t2 t1 t2

1. Proposal. Simulateparticles forward fromtime t1 to time t2. Thenselect an individual to besequenced.

w1 w2 w3

P2. Weighting. Based on thestructure of the proposed trans-mission forest, construct thesubtree of the phylogeny thatconnects the observed sequences.Use this subtree to computeweight of the particle: the con-ditional probability of the newsequence.

3. Resampling. Resampleparticles with probabilityproportional to their weights.

Figure: A schematic of the particle filter. Here, we show steps to run the filterfrom the first sequence to the second. Transmission forests are shown in blackand phylogenies that connect observed sequences, P(t), are shown in blue.Observed sequences are depicted as blue dots. This schematic shows how thealgorithm uses just-in-time construction of state variables to ease computationalcosts. Although the model describes how P(t) relates to T (t) across all branchesof the transmission tree, the algorithm only constructs the subtree of thephylogeny needed to connect the observations (and therefore evaluate conditionalprobabilities of sequences). Note that in our implementation of the particle filterwe introduce additional procedures in the proposal and weighting steps. Theseprocedures, which are detailed below, allow for more accurate assessment of aparticle’s weight (through hierarchical sampling) and estimation of the conditionalprobability of a sequence under a relaxed clock. In our current implementation(Algorithm ??), assimilation of each data point is followed by systematicresampling (Arulampalam et al., 2002; Douc et al., 2005); future developmentsmay aim to increase efficiency further using alternative resampling schemes.

Page 19: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Dimension reduction: A measurement model integratingthe sequence evolution model

We put the evolutionary process for the genetic sequences into themeasurement model.

Formally, let a measurement consist of an assignment of a newsequence to an individual in the transmission tree.

The measurement density involves finding the likelihood of the newsequence given the old sequences and the tree. This likelihood can becomputed efficiently by the peeling algorithm.

Particles representing the latent process do not have to include thehigh-dimensional pathogen genome.

Page 20: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Restriction to a class of physical molecular clocks

A strict molecular clock models the rate of evolution as constantthrough time and across lineages, assuming (i) sequence evolution isMarkovian; (ii) no simultaneous mutations.These assumptions imply a Poisson-like mean-variance relationship(Breto and Ionides, 2011).Overdispersion (known as a relaxed clock) has been shown to improvethe fit of phylogenetic models to observed genetic sequences in manycases (Drummond et al., 2006).In our approach, this corresponds to constructing each edge length ofP(t) as a stochastic process on the corresponding edge of T (t).Various forms of such processes have been assumed in the literature,but not all are self-consistent under Markovian assumptions.For example, log normal clock perturbations lack an additivityproperty: adding a node to split a branch must change theevolutionary process along that branch.We suppose the relaxed clock is a non-decreasing continuous-valuedLevy process. In practice, we use a Gamma process clock.

Page 21: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

A genPomp simulation study

Top: diagnosis data only. Bottom: including sequence datagenpomp · doi:10.NNNN/molbev/mst1 MBE

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

A

0.0

0.5

1.0

1.5

2.0

0 1 2εI0

ε I 1

Log Likelihood

88 90 92

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

D

0.0

0.5

1.0

1.5

2.0

0 1 2εI0

ε I 1

Log Likelihood

−880 −870 −860 −850 −840

●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●

●●●

●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

B

70

75

80

85

90

95

0 1 2εI0

Log

Like

lihoo

d

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

E

−860

−855

−850

−845

−840

−835

0 1 2εI0

Log

Like

lihoo

d

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●● ●

● ● ● ● ●

● ●● ● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

C

70

75

80

85

90

95

0.0 0.5 1.0 1.5 2.0εI1

● ●

●●

● ● ●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

F

−860

−855

−850

−845

−840

−835

0.0 0.5 1.0 1.5 2.0εI1

FIG. 4. Grid-based estimates of likelihood surfaces and likelihood profiles from fitting to simulated data. The top row

shows the surface (A) and profiles (B and C) estimated using only the diagnosis likelihood. The bottom row shows the

surface (D) and profiles (E and F) estimated using both the diagnosis and the genetic likelihood. Red dots and red lines

indicate true values of εI0 and εI1 used in simulation. Point estimates and 95% confidence intervals are shown in green

just above the horizontal axis of the likelihood profile plots. Confidence intervals for E and F account for both statistical

uncertainty and Monte Carlo noise (Ionides et al., 2016) using a square root transformation appropriate for non-negative

parameters. Augmenting the diagnosis data with genetic data yields smaller confidence intervals for εI0 and εI1 , and resolves

the nonidentifiability of these parameters when estimated using only the diagnoses. Note that scales of the likelihood surfaces

shown in A and D are not the same; E and F have the same scale as B and C but with a vertical shift.

15

Page 22: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Detroit data: a young black MSM epidemicgenpomp · doi:10.NNNN/molbev/mst1 MBE

20

30

40

50

6019

8319

8419

8519

8619

8719

8819

8919

9019

9119

9219

9319

9419

9519

9619

9719

9819

9920

0020

0120

0220

0320

0420

0520

0620

0720

0820

0920

1020

1120

12

Year

Age

at D

iagn

osis

Count5

10

15

20

25

FIG. 5. The distribution of age at diagnosis through time for black MSM in Detroit, MI. The cohort that we selected for

analysis is outlined in red. We excluded the data from 2012 to limit effects from delays in updating the MDCH database.

29 individuals that were diagnosed at ages greater than or equal to 60 years are not shown on this plot.

19

Page 23: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Detroit data: a young black MSM epidemicSmith et al. · doi:10.NNNN/molbev/mst1 MBE

A

2460

2470

2480

2490

2500

0.0 0.5 1.0 1.5εI 0

Log

Like

lihoo

d

D

-800

-700

-600

-500

-400

-300

0.0 0.5 1.0 1.5εI 0

Log

Like

lihoo

d

G

2460

2470

2480

2490

2500

0.0 0.5 1.0 1.5εI 0

Log

Like

lihoo

d

B

2460

2470

2480

2490

2500

0.0 2.5 5.0 7.5 10.0εJ0

E

-800

-700

-600

-500

-400

-300

0.0 2.5 5.0 7.5 10.0εJ0

H

2460

2470

2480

2490

2500

4 6 8 10εJ0

C

2460

2470

2480

2490

2500

0 50 100 150ψ

F

-800

-700

-600

-500

-400

-300

0 50 100 150ψ

FIG. 6. Estimated likelihood profiles from fits to data from the black, MSM cohort. A-C show likelihood profiles computed

using only the diagnosis likelihood. D-F show likelihood profiles computed using both the diagnosis likelihood and the genetic

likelihood. G and H show likelihood profiles computed using only the diagnosis likelihood when ψ is fixed at zero. Black

dots represent particle filter likelihood evaluations of parameter sets obtained using iterated filtering. Red dots represent

mean log likelihoods of the multiple likelihood evaluations (black dots) at each point in the profile. Red lines are loess fits

to the red dots. Green bars along the lower margin of each panel encompass 95% confidence intervals for each parameter.

Confidence intervals account for both statistical uncertainty and Monte Carlo noise (Ionides et al., 2016). The smoothed

profile was calculated on the square root scale, appropriate for non-negative parameters, with a green dot indicating the

maximum.

22

A-C. Diagnosisonly.

D-F. Includingsequence data.

G-H. Diagno-sis only, fixingψ = 0.

Page 24: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Moving forward from Smith et al (2017, MBE)

genPomp was demonstrated on simulation-based phylodynamiclikelihood inference for general dynamic models with order 100sequences and order 1000 infected individuals.

Further work is needed to scale to larger systems.

Having access to the full phylodynamic likelihood facilitatesinvestigations of what (if anything) is lost by 2-step methods andsummary statistic methods such as ABC.

Preliminary results: The Volz/Rasmussen likelihood approximationworks well if the true phylogeny is known. Phylogenetic uncertainty,especially when the phylogeny is constructed under assumptionsdifferent from the latent dynamic system, can lead to substantial biasin estimates and confidence regions.

Page 25: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Strengths and limitations of the GenPOMP framework

Strengths:

A large and general model class for population dynamics.

Statistically efficient inference.

Can be used to assess loss of information and biases in methods thatscale better.

Limitations:

Computational requirement.

Some detailed individual-based models may not fit easily into theGenPOMP framework.

Page 26: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

A brief introduction to iterated filtering

Successful SMC allows likelihood evaluation.

This likelihood evaluation is both costly and noisy for non-smallproblems, so requires specialized algorithms to enable effectiveinference.

The IF1 iterated filtering algorithm of (Ionides et al., 2006) averagedfiltered parameters in a perturbed model, repeating with successivelysmaller perturbations.

The IF2 algorithm of (Ionides et al., 2015) simply feeds perturbedparticles at the end of one filtering iteration back as starting valuesfor the next iteration, with decreasing perturbations.

IF1 made possible some previously inaccessible inferences, but IF2 ismuch better!

Page 27: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Comparison of IF1 and IF2on the cholera model.

Algorithmic tuning parame-ters for both IF1 and IF2were set at the values cho-sen by King et al (2008) forIF1.

Log likelihoods of the parameter vector output by IF1 and IF2, bothstarted at a uniform draw from a large 23-dimensionalhyper-rectangle.

Dotted lines show the maximum log likelihood.

Page 28: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Monte Carlo adjusted profile (MCAP) confidence intervals

The usual cutoff δ = 1.92 for a 95% profile confidence interval isbased on an asymptotic quadratic log likelihood (Wilks’ χ2 theorem).

Profile intervals are robust to reparameterization.

A Wilks limit also applies to give a cutoff for a smoothed Monte Carloprofile based on a quadratic approximation (Ionides et al., 2017),

δMCAP = z2α

(a× SE2

mc +1

2

),

where zα is the 1− α/2 normal quantile, a is the quadratic coefficientof a quadratic regression near the profile maximum, SEmc is theMonte Carlo error of the maximum of this quadratic.

if SEmc = 0, the cutoff for α = 0.05 reduces toδMCAP = 1.962/2 = 1.92.

We apply this cutoff after estimating the profile via a locally weightedquadratic smoother.

We call this procedure a Monte Carlo adjusted profile (MCAP).

Page 29: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

A toy: MCAP for a log normal model

●●

●●

●●

●●

●● ●

●●

● ● ●

−2 −1 0 1 2

−14

0−

120

prof

ile lo

g lik

elih

ood

φ

Points show Monte Carlo profile evaluations. Black dashed lines: exactprofile and 95% confidence interval. Solid red lines: MCAP confidenceinterval. Dotted blue line: quadratic approximation.

Exact profile MCAP profile Bootstrap Quadratic

Coverage % 94.3 93.4 93.3 93.3Mean width 0.78 0.88 0.94 0.92

Page 30: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Software

pomp. An R package developed and maintained for 12yr (King et al.,2016). Various tutorials, courses and open-source examples exist.

https://kingaa.github.io/sbied/

https://ionides.github.io/531w18/

panelPomp. An R package extending pomp for PanelPOMP models(Breto et al., 2019)

genPomp. A C++ program written for (Smith et al., 2017)

spatPomp. An R package extending pomp for SpatPOMP models. Apreliminary version will be released soon.

Page 31: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

Collaborators

Contributors on the methodological developments:

Aaron King

Alex Smith

Carles Breto

Joonha Park

Dao Nguyen

Kidus Asfaw

Collaborators on the scientific work:

Ethan Romero-Severson

Mercedes Pascual

Jim Koopman

Erik Volz

Page 32: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

References I

Arulampalam, M. S., Maskell, S., Gordon, N., and Clapp, T. (2002). Atutorial on particle filters for online nonlinear, non-Gaussian Bayesiantracking. IEEE Transactions on Signal Processing, 50:174 – 188.

Bouchard-Cote, A., Sankararaman, S., and Jordan, M. I. (2012).Phylogenetic inference via sequential Monte Carlo. Systematic Biology,61(4):579–593.

Breto, C., He, D., Ionides, E. L., and King, A. A. (2009). Time seriesanalysis via mechanistic models. Annals of Applied Statistics, 3:319–348.

Breto, C. and Ionides, E. L. (2011). Compound Markov counting processesand their applications to modeling infinitesimally over-dispersed systems.Stochastic Processes and their Applications, 121:2571–2591.

Breto, C., Ionides, E. L., and King, A. A. (2019). Panel data analysis viamechanistic models. To appear in JASA. Available at Arxiv:1801.05695.

Page 33: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

References II

Douc, R., Cappe, O., and Moulines, E. (2005). Comparison of resamplingschemes for particle filtering. In Proceedings of the 4th InternationalSymposium on Image and Signal Processing and Analysis, 2005, pages64–69. IEEE.

Drummond, A. J., Ho, S. Y. W., Phillips, M. J., and Rambaut, A. (2006).Relaxed phylogenetics and dating with confidence. PLoS Biology,4(5):e88.

He, D., Ionides, E. L., and King, A. A. (2010). Plug-and-play inference fordisease dynamics: Measles in large and small towns as a case study.Journal of the Royal Society Interface, 7:271–283.

Ionides, E. L., Breto, C., and King, A. A. (2006). Inference for nonlineardynamical systems. Proceedings of the National Academy of Sciences ofthe USA, 103:18438–18443.

Ionides, E. L., Breto, C., Park, J., Smith, R. A., and King, A. A. (2017).Monte Carlo profile confidence intervals for dynamic systems. Journal ofthe Royal Society Interface.

Page 34: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

References III

Ionides, E. L., Nguyen, D., Atchade, Y., Stoev, S., and King, A. A. (2015).Inference for dynamic and latent variable models via iterated, perturbedBayes maps. Proceedings of the National Academy of Sciences of theUSA, 112:719–724.

King, A. A., Ionides, E. L., Pascual, M., and Bouma, M. J. (2008).Inapparent infections and cholera dynamics. Nature, 454:877–880.

King, A. A., Nguyen, D., and Ionides, E. L. (2016). Statistical inferencefor partially observed Markov processes via the R package pomp.Journal of Statistical Software, 69:1–43.

Rasmussen, D. A., Ratmann, O., and Koelle, K. (2011). Inference fornonlinear epidemiological models using genealogies and time series.PLoS Computational Biology, 7(8):e1002136.

Romero-Severson, E., Volz, E., Koopman, J., Leitner, T., and Ionides, E.(2015). Dynamic variation in sexual contact rates in a cohort ofHIV-negative gay men. American Journal of Epidemiology, 182:255–262.

Page 35: Likelihood-based inference for partially observed ...dept.stat.lsa.umich.edu/~ionides/talks/osu19.pdf · 1Department of Ecology and Evolutionary Biology, 2Department of Mathematics,

References IV

Smith, R. A., Ionides, E. L., and King, A. A. (2017). Infectious diseasedynamics inferred from genetic data via sequential Monte Carlo.Molecular Biology and Evolution, 34:2065–2084.