Advanced Statistics for Interventional Cardiologists

Advanced Statistics for Interventional

Cardiologists

What you will learn• Introduction• Basics of multivariable statistical modeling• Advanced linear regression methods• Hands-on session: linear regression• Bayesian methods• Logistic regression and generalized linear model• Resampling methods• Meta-analysis• Hands-on session: logistic regression and meta-analysis• Multifactor analysis of variance• Cox proportional hazards analysis• Hands-on session: Cox proportional hazard analysis• Propensity analysis • Most popular statistical packages• Conclusions and take home messages

1st day

2nd day

Resampling

• Basic concepts and applications

• Bootstrap

• Jacknife

• Other approaches, including Monte Carlo, Markov chain, and Gibbs sampling methods

Resampling


• Bootstrap

• Jacknife


Major breakthroughs since the 70’s


in cardiology


in cardiology in statistics*

*source: American Statistical Association

Samples and populations

This is a sample


And this is its And this is its universal universal populationpopulation


This is another This is another samplesample

And this might be its And this might be its universal universal populationpopulation


But what if THIS is its But what if THIS is its universal universal populationpopulation??


Resampling is based on theResampling is based on theuse of repeat samples fromuse of repeat samples from

the original one to make the original one to make inferences on the targetinferences on the target

population with few population with few assumptions assumptions



1 2

4 53

1 1

5 33

Resampling is based on the use of Resampling is based on the use of repeat samples from the original one repeat samples from the original one

to make inferences on the target to make inferences on the target population with few assumptions population with few assumptions

1 2

43


1 2

4 53

1 1

5 33

1

4 42

2



1 2

43 1

53

2


1 2

4 53

1 1

5 33

1

4 42

2 3 3

5 54



1 2

43 1

53

2 1 2

4 5


1 2

4 53

1 1

5 33

1

4 42

2 3 3

5 54



2

5 55

3

1 2

43 1

53

2 1 2

4 5

1

4 53


1 2

4 53

1 1

5 33

1

4 42

2 3 3

5 54



2

5 55

3

1 2

43 1

53

21 2

4 5

1

4 53

Random resampling with replacementRandom resampling with replacement

Random resampling without replacementRandom resampling without replacement

Resampling

• Resampling refers to the use of the observed

data or of a data generating mechanism (such

as a die or computer-based simulation) to

produce new hypothetical samples, the

results of which can then be analyzed.

• The term computer-intensive methods also

is frequently used to refer to techniques such

as these…

Resampling – the hot issue

• Validity of resampling processes depends

on sample size and selection:

each individual (or item) in the population

must have an equal probability of being

selected

no individual (item) or class of individuals

may be discriminated against

(Pseudo-) random numbers• Items selected during resampling procedures

are often identified by relying on random or

pseudo-random numbers.

•Pseudo-random numbers are apparently

random numbers generated by specific

algorithms that can be repeatedly generated

using the same procedure and settings.

Pseudo-random numbers are used very often

in statistics.

(Pseudo-) random numbersDesirable properties for random number generators are:

•Randomness: provide a sequence of uniformly distributed random

numbers.

•Long period: one requests the ability to produce, without repeating

the initial sequence, all of the random variables of a huge sample

that current computer speeds allow for.

•Computational efficiency: the execution should be rapid.

•Repeatability: initial conditions (seed values) completely

determine the resulting sequence of random variables.

•Portability: identical sequences of random variables may be

produced on a wide variety of computers (for given seed values).

•Homogeneity: all subsets of bits of the numbers are random, from

the most- to the least-significant bits.

Cross-validation• The first and simplest form of resampling is cross-

validation, eg splitting of the original sample in two halves which are separately analyzed.

• More formally, it consists in the partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis.

• The initial subset of data is called the training or derivation set.

• The other subset(s) are called validation or testing sets.

Resampling


• Bootstrap

• Jacknife


Bootstrap

Bootstrap

• The bootstrap is a modern, computer-intensive, general purpose approach to statistical inference, falling within a broader class of resampling methods.

• Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution of the observed data.

Bootstrap• In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset.

• It may also be used for constructing hypothesis tests. It is often used as an alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inference is impossible or requires very complicated formulas for the calculation of standard errors.

Bootstrap• The advantage of bootstrapping over analytical methods is its great simplicity:

it is straightforward to apply the bootstrap to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution, such as percentile points, proportions, odds ratio, and correlation coefficients.

• The disadvantage of bootstrapping is: while (under some conditions) it is asymptotically consistent, it

does not provide general finite-sample guarantees, and has a tendency to be overly optimistic;

the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples) where these would be more formally stated in other approaches.

Efron’s explanations




Bootstrap: how many samples

Increasing the number of bootstrap samples is like increasing the

resolution of an image








Bootstrap: how many samples• For standard deviation/error, most practitioners suggest

that the number of bootstrap samples (B) should be around 200.

• For 95% confidence intervals, B should be between 1000 and 2000.

• Further, estimating a confidence interval usually requires estimating the 100α percentile of the bootstrap distribution. To do this, the bootstrap sample is first sorted into ascending order. Then, if α(B+1) is an integer, the percentile is estimated by the α(B+1)th member of the ordered bootstrap sample. Otherwise, interpolation must be used, between the [α(B+1)]th and ([a(B+1)]+1)th members of the ordered sample, where [ ] denotes the integer part. Consequently, choosing B=999 or B=1999 leads to simple calculations for the common choices of α.

Bootstrap: which type of resampling?

• Bootstrap with non-parametric resampling: makes no assumptions concerning the distribution of, or model for, the data.

• Bootstrap with parametric resampling: we assume that a parametric model for the data is known up to the unknown parameter vector.

• Bootstrap with semi-parametric resampling: a variant of parametric resampling, appropriate for some forms of regression.

Bootstrap: which type of resampling?Then, which one to choose?

•Parametric and non-parametric simulation make very different assumptions. The general principle that is that the simulation process should mirror as closely as possible the process that gave rise to the observed data. Thus, if we believe a particular model (ie we believe that the fitted model differs from the true model only because true values of the parameters have been replaced by estimates obtained from the data), then the parametric (or in regression, preferably semi parametric) bootstrap is appropriate. •However, examination of the residuals may cast doubt on the modelling assumptions. In this case, non-parametric simulation is often appropriate. It is interesting to note that, in practice, non-parametric simulation gives results that generally mimic the results obtained under the best fitting, not the simplest parametric model.

Bootstrap and beyond

Example: bootstrapping a logistic regression analysis

13 7 20 1513 19 6

20 16 1920 19

14 18 7 16 218 20 7 1120 19 15

2013 12 15 8 18 7 19

15 13 1913 4

12 15 3

15 16 315 20 4

16 13 2 1918 20 3

13 15 2015 13

15 20 71315

13 1412 20 18

2 20 15 7 19 1213 20 15 19

Predictors selected in 25 bootstrap replications for the cohort study (n=155; 33 with events), based on forward stepwise logistic regression.The predictors selected by the actual data were variables 13, 15, 7, 20.

A pivotal work on bootstrap

Bootstrap for internal validation

Bootstrap for internal validation

Internal validation of predictors generated by multivariable logistic regression analysis was performed by means of bootstrapping techniques, with 1000 cycles and generation of OR and bias-corrected 95% CI.14

14. Steyerberg EW, Harrell FE Jr, Borsboom GJJM, Eijkemans MJCR, Vergouwe Y, Habbema JDF. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001;54:774–781.

Resampling


• Bootstrap

• Jacknife


Jacknife

Jacknife• Jacknifing is a resampling method based

on the creation of several subsamples by

excluding a single case at the time.• Thus, the are only N jacknife samples for

any given original sample with N cases.• After the systematic recomputation of the

statistic estimate of choice is completed,

an point estimate and an estimate for the

variance of the statistic can be calculated.

Bootstrap vs jacknife: the winner is…

Jacknife is an acceptable approximation of bootstrap only for linear statistics

Resampling


• Bootstrap

• Jacknife


Do you like Monte Carlo?

Monte Carlo resampling• Monte Carlo methods are a class of computational

algorithms that rely on repeated random sampling to

compute their results. They provide approximate solutions to

mathematical problems by sampling experiments.

• Because of their reliance on repeated computation and

random or pseudo-random numbers, Monte Carlo methods

are most suited to calculation by a computer, being

chosen when it is infeasible or impossible to compute an

exact result with a deterministic algorithm.

• These simulation methods are especially useful in studying

systems with a large number of coupled degrees of

freedom, eg fluids, social behaviors, hierarchical models, ….

Monte Carlo resampling• More broadly, Monte Carlo methods are useful for

modeling phenomena with significant uncertainty in

inputs (eg Bayesian statistical analyses).

• The term Monte Carlo was coined in the 1940s by physicists

working on nuclear weapon projects in the Los Alamos

(Ulam, Fermi, von Neumann, and Metropolis). The name is

a reference to the Monte Carlo Casino in Monaco where

Ulam's uncle would borrow money to gamble.

• The use of randomness and the repetitive nature of the

process are analogous to the activities conducted at a

casino.

Impact of Monte Carlo methods• Monte Carlo simulation methods do not always require

truly random numbers to be useful — while for some

applications unpredictability is vital.

• Many of the most useful techniques use deterministic,

pseudo-random number sequences, making it easy to

test and re-run simulations.

• Monte Carlo resampling simulation takes the mumbo-

jumbo out of statistics and enables even beginning

students to understand completely everything that is done.

• The application of Monte Carlo methods in teaching

statistics also is not new.

Impact of Monte Carlo methods

• What is new and radical is using Monte Carlo

methods routinely as problem-solving tools for

everyday problems in probability and statistics.

• Computationally intensive, but conceptually simple,

methods belong at the forefront, whereas traditional

analytical simplifications loose in importance.

• Monte Carlo simulations are not only relevant for

simulating models of interest, but they constitute also a

valuable tool for approaching statistics.

Markov chain• In mathematics, a Markov chain, named after Andrey

Markov, is a discrete-time stochastic process with the Markov property.

• Having the Markov property means for a given process that knowledge of the previous states is irrelevant for predicting the probability of subsequent states.

• In this way a Markov chain is "memoryless", no given state has any causal connection with a previous state.

• At each point in time the system may have changed states from the state the system was in the moment before, or the system may have stayed in the same state. The changes of state are called transitions. If a sequence of states has the Markov property, then every future state is conditionally independent of every prior state.

Markov chain• The PageRank of a webpage as used by Google is

defined by a Markov chain.

• Markov chain methods have also become very important for generating sequences of random numbers to accurately reflect very complicated desired probability distributions, via a process called Markov chain Monte Carlo (MCMC).

• In recent years this has revolutionised the practicability of Bayesian inference methods, allowing a wide range of posterior distributions to be simulated and their parameters found numerically.

Markov chain Monte Carlo (MCMC)

• Markov chain Monte Carlo (MCMC) methods, are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution.

• The state of the chain after a large number of steps is then used as a sample from the desired distribution. The quality of the sample improves as a function of the number of steps.

• Usually it is not hard to construct a MCMC with the desired properties. The more difficult problem is to determine how many steps are needed to converge to the stationary distribution within an acceptable error.

Metropolis-Hastings algorithm• The Metropolis-Hastings algorithm is a method for

creating a Markov chain that can be used to generate a sequence of samples from a probability distribution that is difficult to sample from directly.

• This sequence can be used MCMC simulation to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value).

• The Gibbs sampling algorithm is a special case of the Metropolis-Hastings algorithm which is usually faster and easier to use but is less generally applicable in physics (but actually is the most common in Bayesian analysis).

Gibbs sampler• Gibbs sampling is an algorithm to generate a sequence of

samples from the joint probability distribution of two or more random variables.

• The purpose of such a sequence is to approximate the joint distribution, or to compute an integral (such as an expected value).

• Gibbs sampling is a special case of the Metropolis-Hastings algorithm , and thus an example of a Markov chain Monte Carlo algorithm (MCMC). More specifically, it is a specific form of MCMC in which values Xn at successive sites n are updated using the full conditional distribution of Xn given the values Xm at all other sites m = n. It is also known as the heat bath sampler. Successive sites n may be chosen systematically or randomly.

• It is also known as the heat bath sampler.

One of my favourite articles

One of my favourite articles

A pivotal work based on Monte Carlo Methods

Concato et al, JCE 1995

Softwares• BoxSampler (add-in for Excel)• C• DDXL• R• Fortran• Lisp• Pascal• Resampling Stats (add-in for Excel)• RiskAMP

•S•S-Plus•SAS• SimulAr• StatsXact

Questions?

Take home messages• Resampling methods are being increasingly used given

the major breakthroughs in computational power• Among cross-validation, jacknife, and bootstrap, the latter

is the most powerful and robust tool for statistical inference

and validation.• Monte Carlo simulation methods have become more and

more common for estimating the bias of different statistical

procedures and for complex Bayesian models.• In all but very exceptional cases, resampling methods are

best left in the hands of a statistical professional, working

together with the clinician in order to achieve the clinician’s

goal safeguarding validity

And now a brief break…

For further slides on these topics please feel free to visit the

metcardio.org website:

http://www.metcardio.org/slides.html

Documents

Advanced Statistics for Interventional Cardiologists