12
STATISTICS IN MEDICINE Statist. Med. 2001; 20:2537–2548 (DOI: 10.1002/sim.727) Theory and practice in medical statistics Peter Armitage ;2 Reading Road; Wallingford; Oxon OX10 9DP; U.K. SUMMARY The discipline of statistics generally, and that of medical statistics in particular, has evolved through two traditions, the theoretical and the practical. Their history, traced briey here, shows recurrent points of contact and a process of gradual merging which is now almost complete. However, the current tendency to build complex mathematical and computing models may lead to over-condence in their conclusions, when some crucial aspects of the data cannot easily be modelled satisfactorily. Illustrations are drawn from the use of formal stopping rules in the monitoring of clinical trials, and from the treatment of missing data and incomplete compliance with the trial protocol. Copyright ? 2001 John Wiley & Sons, Ltd. 1. INTRODUCTION My title must surely seem presumptuous. Since retirement I have contributed nothing to statistical theory and have indulged in very little statistical practice. Anything that I have to say will therefore be heavily inuenced by the past, and will be in no way innovative. I want, though, to reect a little on the interacting roles of theory and practice in the history of medical statistics, and to enquire as to their eect on what we do today. Later in the talk I shall comment on the implications of these general issues for two topics of current interest in the conduct of clinical trials: data monitoring procedures, and adjustments for missing readings and non-compliance. Much of what I have to say will relate to statistics as a whole. Michael Healy gave a lecture [1] entitled ‘Does medical statistics exist?’ – a bold choice for an inaugural lecture by a Professor of Medical Statistics! Fortunately, he decided somewhat grudgingly that his chosen discipline did indeed exist. Medical statistics certainly has some odd features, and 50 years ago it must have seemed to occupy a rather isolated position. However, now it is rmly embedded within its parent discipline of statistics, and it is increasingly dicult to draw boundaries to distinguish the part from the whole. Correspondence to: Peter Armitage; 2 Reading Road; Wallingford; Oxon OX10 9DP; U.K. E-mail: [email protected] Copyright ? 2001 John Wiley & Sons, Ltd.

Theory and practice in medical statistics

Embed Size (px)

Citation preview

Page 1: Theory and practice in medical statistics

STATISTICS IN MEDICINEStatist. Med. 2001; 20:2537–2548 (DOI: 10.1002/sim.727)

Theory and practice in medical statistics

Peter Armitage∗;†

2 Reading Road; Wallingford; Oxon OX10 9DP; U.K.

SUMMARY

The discipline of statistics generally, and that of medical statistics in particular, has evolved through twotraditions, the theoretical and the practical. Their history, traced brie4y here, shows recurrent points ofcontact and a process of gradual merging which is now almost complete. However, the current tendencyto build complex mathematical and computing models may lead to over-con8dence in their conclusions,when some crucial aspects of the data cannot easily be modelled satisfactorily. Illustrations are drawnfrom the use of formal stopping rules in the monitoring of clinical trials, and from the treatment ofmissing data and incomplete compliance with the trial protocol. Copyright ? 2001 John Wiley &Sons, Ltd.

1. INTRODUCTION

My title must surely seem presumptuous. Since retirement I have contributed nothing tostatistical theory and have indulged in very little statistical practice. Anything that I haveto say will therefore be heavily in4uenced by the past, and will be in no way innovative.I want, though, to re4ect a little on the interacting roles of theory and practice in the historyof medical statistics, and to enquire as to their e>ect on what we do today. Later in the talkI shall comment on the implications of these general issues for two topics of current interestin the conduct of clinical trials: data monitoring procedures, and adjustments for missingreadings and non-compliance.

Much of what I have to say will relate to statistics as a whole. Michael Healy gave alecture [1] entitled ‘Does medical statistics exist?’ – a bold choice for an inaugural lectureby a Professor of Medical Statistics! Fortunately, he decided somewhat grudgingly that hischosen discipline did indeed exist. Medical statistics certainly has some odd features, and50 years ago it must have seemed to occupy a rather isolated position. However, now it is8rmly embedded within its parent discipline of statistics, and it is increasingly diHcult todraw boundaries to distinguish the part from the whole.

∗ Correspondence to: Peter Armitage; 2 Reading Road; Wallingford; Oxon OX10 9DP; U.K.† E-mail: [email protected]

Copyright ? 2001 John Wiley & Sons, Ltd.

Page 2: Theory and practice in medical statistics

2538 P. ARMITAGE

Some time ago, a colleague of mine was spending a sabbatical year overseas, and I hadoccasion to ask him about a statistician working in his host department. My friend respondedby saying that he was ‘an excellent man: one of the few people there who did not letHilbert space come between himself and the comparison of two means’. Apart from thewitticism, this remark illustrates vividly the clash between two traditions, the theoretical andthe practical. Many statisticians have tended to align themselves strongly in one or the otherdirection, but perhaps even more have tried to keep a foot in each camp. We should rememberalso that any boundaries that might be drawn between the two traditions may be extremelyevanescent. Ninety years ago, the use of the t distribution to compare two means would havebeen regarded as a theoretical and perhaps avant-grade device, in spite of the fact that itspseudonymous author, ‘Student’, was in fact W.S. Gosset, an industrial chemist employed byGuinness breweries.

As I shall indicate later, we can trace two di>erent paths through the history of statistics– the theoretical and the practical – but it would be a mistake to think that these are clearlyseparated. In Marcel Proust’s long novel A la Recherche du Temps Perdu, the narrator recallsthe walks he took as a child, along two di>erent ways. One way led past the grand houseof a local aristocrat, the other past the house of a family friend associated with commerce,art and various less praiseworthy pursuits. This contrast turned out to be an allegory of thetwo life-styles between which the narrator oscillated in later life. He came to realize, though,that the two paths were connected in ways he had not realized as a child, just as the twoworlds of the aristocracy and the lower orders mingled in real life. It is the same with thetheoretical and practical routes through statistics. They interconnect and in4uence each otherin countless ways.

All disciplines have a theory of some sort. The natural sciences progress on the basisof observation and experiment, and theory is needed as a body of ideas knitting togetherthe observational facts, providing coherence and the ability to predict future 8ndings. Puremathematics might be thought to be all theory, but again it is underpinned by a body of generalconcepts unifying and implying speci8c results that would otherwise be isolated. Appliedmathematics has its own formidable body of theory (statics, dynamics, electromagnetism etc.),but this is accompanied by observational and experimental 8ndings binding the theory to thereal world. The 8ne arts possess theory of a sort, but this tends to be essentially a descriptionof what has been done in the past; it is a mark of a great musician, writer or painter tobe able to break the rules and produce work that would have been utterly unpredictable: notheory at the time could have predicted a SchOonberg or a Kandinsky. Revolutions occur alsoin the natural sciences, but these are concerned with the approach to truth rather than (asin the arts) to beauty. Medicine, engineering and architecture come somewhere between thesciences and arts; their theory is partly scienti8c and partly based empirically on precedentand convention.

Statistics is based on mathematics, and its theory shares some of the general features ofmathematical theory – theorems, proofs and the like. It has also a broader basis of generalprinciples of a non-mathematical kind, to do with reliability of data, integrity of data presenta-tion, subserviance to the scienti8c purposes of an investigation, and so on. Indeed, the wholequestion of the place of mathematics in statistics is part of this broader ‘meta-theory’ of oursubject. There is an interesting series of papers and discussion on ‘Statistics and mathematics’in a recent issue of The Statistician [2].

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 3: Theory and practice in medical statistics

THEORY AND PRACTICE IN MEDICAL STATISTICS 2539

2. THEORY AND PRACTICE IN THE HISTORY OF STATISTICS

I want to trace, in a very sketchy way, some features of the historical development of these twotraditions of our subject, indicating some of their interconnections. The theoretical traditionis best seen in relation to statistics as a whole, since the common elements have been muchmore in4uential than those peculiar to particular 8elds of application. The practical tradition,on the other hand, is highly subject speci8c, and I shall concentrate on the medical arena,rather than those of, say, social or agricultural applications. I should add that my historicalobservations derive almost entirely from secondary sources, in particular from the historicalaccounts by Stigler [3] and Hald [4; 5].

2.1. The theoretical tradition

The important early developments in probability theory came in spurts in the 16th, 17th andearly 18th centuries. The motivation came from games of chance with dice or cards, and fromlotteries. Cardano’s early work, in the mid-16th century, was followed a century later by thecorrespondence of Pascal and Fermat, and Huygens’s contemporary work. Then, after anotherhalf-century, we have the consolidating work of Montmort, James and Nicholas Bernoulli andDe Moivre.

The 17th century work could be said to be related to practice, since it was inspired, and nodoubt subsidized, by the aristocratic gamblers of the day. It was, however, quite independent ofany data. The probabilities involved were determined logically by the symmetry of the dice andthe equiprobability of the cards after shuRing. The 18th century workers, by contrast, revealan awareness of quantitative scienti8c observations, especially in astronomy and demography.The calculus of probability was developed, still for discrete distributions but now allowingthe players of games of chance to have di>erent degrees of skill. Thus, we have the generalbinomial distribution, with the law of large numbers, from James Bernoulli, and its normalapproximation from De Moivre. We also see the beginning of an interest in the subjectiveinterpretation of probability, to be formalized later in the century by Bayes and Laplace.

Most of these early authors were devoutly religious, and theological implications are fre-quently cited. Pascal’s famous wager about the existence of God is essentially an exercise indecision theory; a pious life is to be preferred to a worldly one because it has in8nite utilityif God exists. Much debate followed the earliest known signi8cance test, by Arbuthnot in1712, who performed a sign test on the data showing that male births exceeded female in 82successive years. He claimed that the departure from ‘chance’ – the null hypothesis of equalprobabilities for male and female – proved the existence of divine providence, whereas otherspointed out that chance could still exist even though the probabilities were unequal.

Fienberg [6] sees two strands emerging in the period from 1750 to 1820. The 8rst isthe development of statistical inference via inverse probability, by Bayes and, independently,Laplace. Laplace’s principle of indi>erence led to 4at priors, and thus a posterior distribu-tion proportional to the likelihood. The central limit theorem led to approximations by thenormal distribution, which had been derived by Gauss in its own right rather than merely asan approximation to the binomial. The use of the normal distribution with calculated stan-dard errors thus led to methods of inference in which it was often diHcult to distinguishbetween inverse and direct probability (what we should now call Bayesian and frequentistmethods).

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 4: Theory and practice in medical statistics

2540 P. ARMITAGE

The second strand was the combination of observations to estimate scienti8c constants,especially in astronomy. Legendre invented least squares, Laplace exploited the method, andGauss showed that if the errors were normally distributed, with a 4at prior, least squaresmaximized the posterior probability (and hence gave the maximum likelihood solution). Thus,what Stigler called the Gauss–Laplace synthesis brought together the foundations of statisticalinference and the techniques for handling the normal linear model, and provided the basisfor the theoretical developments of the 19th century. These included much detailed workon distributional theory, sometimes in relation to speci8c areas of application, but often bymathematicians otherwise not closely involved with practical data analysis.

The most important conceptual development taking place in the last quarter of the centurywas the work on correlation and regression by Galton, Pearson, Edgeworth and Yule. Thisso closely epitomizes the linkage between theory and practice that I could equally well haveincluded it in the ‘practical’ side of my account. However, it should be included here inthe ‘theoretical’ account, because of its in4uence on statistical theory throughout the presentcentury. Galton provided the biological motivation, through his interest in anthropology andheredity, Edgeworth and Yule were interested especially in social and economic applications,and Pearson was involved in everything. One interesting point is the lack of awareness ofwork on least squares done throughout the century. In the early work on the combination ofobservations the random variation arose from errors of observation and was relatively small.Galton and his colleagues had essentially the same model, but the random variation wascaused principally by the inherent biological variability and was proportionately larger. Theyeither did not spot the connection or perhaps, as Hald implies, were unaware of the researchpublished by continental workers. At any rate, the biometric school stimulated the outpouringof statistical theory that was to follow during the 8rst half of the 20th century.

The central 8gure during this period was, of course, R.A. Fisher, and I do not need tosummarize his achievements here. Two points should perhaps be noted. Fisher was the 8rstperson to abandon the reliance on inverse probability, and to build a theory of statisticalinference explicitly on frequentist methods – even though his 8ducial theory purported toallow inductive statements about the probabilities of hypotheses without the need to introducepriors. Secondly, his astonishing corpus of theoretical results, especially in the theory ofestimation, was constantly stimulated by his concern with biological research, particularly inagriculture and genetics. He is not only the predominant statistician of the 20th century, butis also perhaps the only statistician to stand as a leader in his 8eld of application.

I shall return to the question of the interplay between theory and practice at the presenttime, but I now want to retrace my steps by following the other tradition, based primarily onpractice rather than theory, concentrating on medical applications.

2.2. The practical tradition in medical statistics

Any account of practical statistics, whether restricted to medical applications or not, must givepride of place to John Graunt’s Observations upon the Bills of Mortality, published in 1662,a decade after the Pascal–Fermat correspondence. Graunt knew nothing of this, and would inany case have shown little interest in it. His approach was wholly non-mathematical, yet hisanalyses show great ingenuity and sensitivity to the quality of the data.

The Bills of Mortality were weekly and yearly records of deaths, by cause, in London,during a period that included several outbreaks of plague. Additionally, Graunt had records of

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 5: Theory and practice in medical statistics

THEORY AND PRACTICE IN MEDICAL STATISTICS 2541

christenings in London, and of deaths, weddings and christenings in some other places. Themain de8ciencies of the data were the absence of age at death and the lack of populationsize. Graunt makes various shrewd inferences from 4uctuations in ratios, for instance the ratioof burials to christenings, which peaks in epidemic years. He overcomes the lack of data onpopulation size by making his own estimate by three di>erent methods: from christenings,using an estimate of the mean number of children per mother; from the deaths, doing aseparate small survey to estimate the number of deaths per family; and from a map whichenabled him to estimate the number of families in the area. This was the 8rst attempt toestimate empirically the population of London – 400 000–500 000 at that time.

Graunt also laid the foundations for the calculation of a life table. This is the more re-markable when it is recalled that he lacked information on age at death and population size.However, he makes cunning use of the stated causes of death. He estimates the mortality rateat 0–6 years by using the causes of death peculiar to young children and relating those to thenumbers of christenings. He then uses the cause ‘Aged’ to estimate the proportions of deathsover 65 and 75 years. This gives him life table survival rates at three ages, 6, 65 and 75years, and he then interpolates between them to complete his life table. All this, of course, ishighly speculative, but it provides the basis on which Halley and others built in later years.Greenwood [7] remarks that Graunt’s ‘brutal common sense’ was ‘one of the qualities whichmade [him] a pioneer’. Greenwood continues: ‘Making the best the enemy of the good is asure way to hinder any statistical progress’.

Graunt’s work is important not only in its own right, but also in stimulating the collectionand analysis of social and demographic data throughout the succeeding centuries. Graunt’sfriend William Petty shared his interest in demographic data, but was concerned more widelywith economic and social statistics and with the organization of appropriate institutions. Headvocated a national scienti8c society, later to come to fruition as the Royal Society, and anational statistical oHce. He is quoted as saying ‘God sent me the use of things, and notions,whose foundations are secure and the superstructures mathematical reasoning; for want ofwhich props so many Governments doe reel and stagger’.

Edmond Halley, the astronomer, a generation after Graunt was able to construct the 8rstauthentic life table by using data for the city of Breslau, where there were registers of birthsand deaths classi8ed by age and sex, although no population size estimates. Halley’s life table,based on numbers of births and deaths, automatically provided a population size estimate.Halley pointed out the importance of his results for life insurance, where there had beenconsiderable uncertainty about the price to pay for an annuity, and this led to the developmentof insurance mathematics of De Moivre and Simpson in the 18th century.

Apart from these developments in actuarial theory and practice, the 18th and early 19th cen-turies are disappointing. Greenwood describes a number of statistical commentaries by variousBritish and continental physicians and demographers, but none of this work is as memorableor as innovative as that of their 17th century predecessors. However, one demographic featurefascinated many of the theorists throughout this period: the ratio of male to female births.

Graunt had demonstrated a slight excess of male over female births, at least in London andRomsey. I have already mentioned Arbuthnot’s signi8cance test on the series of 82 successiveyears with an excess of males over females. There followed a series of contributions fromNicholas and Daniel Bernoulli, De Moivre, Struyk, SOussmilch, Bu>on, Laplace, and, in theearly 19th century, Poisson and Cournot. These authors were concerned with the variabilityof the probability of a male birth, over time and between geographical regions, and made use

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 6: Theory and practice in medical statistics

2542 P. ARMITAGE

of the binomial distribution. Later workers became interested in possible variation betweenfamilies, but here the data are diHcult to interpret because the decision whether or not tocontinue with family building may depend on the balance of sexes achieved so far. A largedata set published by Geissler [8] in 1889 continues to stimulate research [9–11].

The major British 8gure in the mid-19th century was William Farr, who held the modestposition of compiler of abstracts in the General Register OHce, but in fact wrote many of theRegistrar General’s reports and played a leading role in national and international statistics.Farr had the advantage of being able to work with census data and with death registrationsincorporating cause of death and occupation, much as at present except that the data wereless reliable than nowadays. As Hacking [12] wrote, Farr ‘was no great mathematician, buthe did know how to prepare a life table’. His reputation rests not only on his compilationsof national vital statistics, but on his penetrating epidemiological inferences from them andon his polished literary style. Describing a life table starting with 100 000 births, he writes:

‘The mental faculties, ripened and developed by experience, will not protect theframe from the accelerated and insidious progress of decay; the toil of the labourer,the wear and tear of the artisan, the exhausting passions, the struggles and strainsof intellect, and, more than all these, the natural falling o> of vitality, will reducethe numbers to 9398 by the age of 80. ... After the age of 80 the observationsgrow uncertain; but, if we admit their accuracy 1140 will attain the age of 90;16 will be centenarians; and, of the 100 000, one man and one woman – like thelingering barques of an innumerable convoy – will reach their distant haven in105 years, and die in 1945’.

Farr’s importance lies also in the fact that his reports stimulated and supported the socialreformers of the 19th century, such as Florence Nightingale.

There are two important continental workers who had quite di>erent connections with Farr.Adolphe Quetelet, a Belgian mathematician and astronomer, is said to have introduced Gaussto celestial mechanics before turning his own interests towards statistics. He was a man ofenormous energy, playing a leading role in the founding of many national and internationalstatistical societies, including the Statistical Society of London (later to become the RoyalStatistical Society (RSS)). Hence the connection with Farr, another RSS pioneer. Queteletdeveloped a passion for social statistics, and related medical issues. He was not a greattheorist, but his work had considerable impact through his advocacy of the normal distributionto describe real data rather than merely as the outcome of the central limit theorem. His notionof the ‘average man’ became famous, and these two concepts formed the basis of present-dayideas of reference distributions.

Less well known is Pierre-Charles-Alexandre Louis, a French physician who sought to putmedicine on a sounder footing by reliance on empirical and controlled observations rather thanauthority. He described this approach as ‘the numerical method’. It gained many adherents,including Farr, and many opponents. His methods were arithmetic rather than mathematical,and Greenwood [13] asks ‘what might have been the future of clinical statistics if Louishad secured the collaboration of ... Poisson. ... [We] might have had a French school ofclinical statistics which would have led the world’. A good example, this, of the gap betweentheoretical and practical statistics.

As I suggested earlier, the gap really closed with the advent of Francis Galton, Karl Pearsonand W.F.R. Weldon in their studies of heredity and biological variation, which led to the

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 7: Theory and practice in medical statistics

THEORY AND PRACTICE IN MEDICAL STATISTICS 2543

establishment of Pearson as the leading British statistician and the foundation of the biometricschool and of the journal Biometrika. It is true that Pearson had an incomplete view ofheredity, and that the reconciliation of the biometric and Mendelian theories had to awaitFisher’s 1918 paper [14], but the fusion of theoretical and practical work, as illustrated bythe early volumes of Biometrika, is truly impressive, however quaint much of the work nowseems. The application covered social and medical, as well as biological, problems, and theyformed a nourishing environment for the young Major Greenwood, who had recently quali8edin medicine but found himself much more attracted to laboratory research than to the bedside.

The closure of the gap between theory and practice was of course con8rmed by R.A. Fisher,who could not reasonably be faulted in either respect. How could anyone separate them whenthe foremost statistical theorist was also a great biologist? Yet the very success of Fisher’spenetrating theoretical work stimulated some to develop the theory on more formal lines,so that it became what has been called ‘statistical mathematics’ rather than ‘mathematicalstatistics’. In medicine the pragmatic tradition continued in this country under Greenwoodand Bradford Hill. They both retained strong interests in vital statistics and epidemiology,Greenwood with the touch of an amateur mathematician, Hill with a genius for research strat-egy and an arithmetician’s intuitive grasp of the characteristics of numerical data. Greenwood’snear-contemporary in the United States, the biologist Raymond Pearl, had also visited Pear-son, and became widely known for his work on population studies. He published on virtuallyevery conceivable 8eld of medical application, and was greatly in4uential in the developmentof biometric teaching in the United States.

2.3. The present position

I have drawn a picture of two developing traditions, interacting from time to time and graduallymoving together at the turn of this century. It is, of course, oversimpli8ed, but perhaps notunduly so. During the 8rst half of this century it would have been quite common to 8ndstatisticians employed in government, concerned primarily with descriptive and administrativeuses of statistics, whose training involved little or no theory (there were, after all, veryfew university courses) and whose interest in theoretical developments was negligible. Now,even in posts in oHcial statistics, one would expect to 8nd a knowledge and appreciation oftheoretical work, a recognition of its importance, and an ability to apply new developments.Numbers of posts in medical and pharmaceutical research have grown enormously, and theirincumbents probably have as wide an interest in theoretical developments as many academicstatisticians, if perhaps not as deep an involvement. Moreover, those whose interests areprimarily theoretical 8nd themselves welcome partners in applied research projects, togetherwith more applied statisticians and medical researchers, and increasingly derive inspiration forresearch from the applied problems they encounter.

I believe, therefore, that the present position is a healthy one. Nevertheless, many commen-tators, while recognizing the liaison I have referred to, are concerned about what they seeas the over-mathematization of our subject. I mentioned earlier the discussion published inThe Statistician on ‘Statistics and mathematics’. In the discussion following the four papers,Priestley [15] summarized the consensus as follows: ‘... mathematical language is indispens-able, mathematical techniques are extremely valuable, mathematical frills are of dubious value,mathematical indulgence is of little value and mathematical window-dressing is of no value’.As Priestley points out, ‘the crucial point is where one draws the lines between the various

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 8: Theory and practice in medical statistics

2544 P. ARMITAGE

categories, and on this there will be a wide divergence of opinion’. I am not myself tooconcerned about over-mathematization. Frills and window-dressing can be ignored most ofthe time, and natural selection will determine which results survive and 8nd their way intostatistical practice.

There are, though, two related sources of concern. The 8rst is the apparently irreversibletrend towards the mathematization of journals that set out initially to be intelligible to the lessacademic practitioners. We all recognize examples of this trend. The second concern is thatthe growth of the theoretical infrastructure of our subject – and this includes the increasinglyimportant contributions from computing – leads to a tendency to 8t more and more theoryinto a one-year graduate course, inevitably at the expense of practical issues. Perhaps thesolution is to aim at courses longer than one year, or, if this is not possible, to supplementthe initial groundwork by more in-house training during the early years of employment. I amaware that many such courses exist and that they are well supported by many employers, forexample in the pharmaceutical industry.

We 8nd ourselves in a period of rapid development, in which mathematical models andcomputer simulations deal with more and more subtle aspects of data, and thus provide closerapproximations to reality. There is, though, a danger of assuming that a group of techniques,heavily supported by impressive theory, automatically takes care of all possible complexities.I want to discuss, without too much technical detail, some issues that arise in the analysis ofclinical trials, with a view to raising matters for consideration rather than providing solutions.

3. SOME ISSUES ARISING IN THE ANALYSIS OF CLINICAL TRIALS

3.1. Interim analysis

From the 1950s onwards, clinical trial investigators have recognized the case for the mon-itoring of accumulating data. Statisticians [16–20] developed systems of sequential analysisadapted to the needs of triallists, with stopping boundaries preserving a speci8ed type I errorprobability. These were used mainly in small trials under the control of a single investigator,with a clearly de8ned primary endpoint.

With the advent of large multi-centre trials, often with prolonged treatment periods, it be-came clear that interim analyses at discrete time points were much more convenient than asystem of continuous monitoring, and methods of group sequential analysis were developed[21; 22]. Group sequential methods accord naturally with another development, the estab-lishment of data and safety monitoring committees (DSMCs) which meet at fairly regularintervals throughout the periods of accrual and follow-up in most large-scale trials. Ideally,the DSMC would have a well-de8ned set of boundaries, the crossing of which would lead toa recommendation that the investigators should stop the trial. It is fairly widely recognizedthat such schemes should not be fully prescriptive, in that the decision to stop must dependon various factors not entirely predictable at the outset, and that some degree of 4exibilitymust be allowed [23]. I shall mention some of these factors in a moment. It is, nevertheless,usually regarded as good practice to de8ne a stopping rule in the protocol of the trial, eventhough it may carry less than full authority.

I am going to suggest that in some relatively complex trials it may be better not to specifyany particular set of stopping rules in the protocol, on the grounds that any chosen system

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 9: Theory and practice in medical statistics

THEORY AND PRACTICE IN MEDICAL STATISTICS 2545

greatly oversimpli8es what will in fact determine the DSMC’s recommendation, and that itis better to play the matter by ear. I will illustrate this point of view by referring to theConcorde trial [24; 25] to compare immediate and deferred zidovudine treatment for personswith asymptomatic HIV infection. In early discussions, the DSMC decided not to adopt anyformal stopping rule, for the following reasons:

(i) In Concorde, there were three primary endpoints: progression to clinical disease; serioustoxicity, and death. Di>erences between treatment groups in any of these, or in speci8ccomponents of disease or toxicity, might be a cause for concern and provide a reasonfor early termination.

(ii) Information becoming available from other studies might lead to termination of Con-corde or at least to a change of protocol. In fact, serious consideration was given tothe emerging results from an American study, ACTG 019, although the decision wastaken to continue with Concorde.

(iii) The protocol drawn up by the steering committee had suggested that termination shouldbe considered only if the results from Concorde or elsewhere might ‘signi8cantly alterclinical practice’ – a stronger requirement than merely evidence against a null hypoth-esis of zero di>erence – but the critical level was not de8ned. Indeed it would be verydiHcult to de8ne in advance, and might well change with time.

(iv) The outcome of treatment was to be judged over a period of several years. It was,though, conceivable that e>ects appearing after a short follow-up period might notpersist and might indeed be reversed. In this particular trial, such a reversal might havebeen caused by the emergence of drug resistance in patients given immediate treatment.In the event, a reversal did occur: the early results favoured immediate treatment, inagreement with ACTG 019, but the trend was reversed in the 8nal results.

In a situation like this, it would be naive to think that any formal stopping rule adequatelycontrolled the relevant type I error probabilities, but this is not to say that the theoretical workwas irrelevant. It provided a backcloth against which subjective judgements could be made. Forinstance, the general e>ect of repeated signi8cance testing and multiple endpoints in enhancingtype I error probabilities was well recognized in the Concorde DSMC, and, if emergingeHcacy di>erences had required that termination should be considered, the qualitative e>ectsof repeated testing and multiplicity would have been taken into account. The report of anytrial should indicate as clearly as possible the considerations that a>ected the decisions aboutcontinuation or termination, but it may not always be wise to represent these constraints byalgorithms. My argument here is that in some complex situations it may be better to recognizethat a theoretical model can go only part of the way in simulating reality, than to allow itsuse to give a false impression of adequacy.

3.2. Protocol deviations and missing data

There can be very few, if any, trials in which all the requirements of the protocol are impec-cably ful8lled for every participant. Some of these deviations are often swept under the carpetby the ‘intention-to treat’ approach, but that may often seem to defeat the purpose of the trial,and in any case it does not easily deal with missing information such as the hypothetical re-sponses of patients who have opted out of the trial. The traditional advice [26] has been thatinvestigators should take every precaution to minimize such lacunae – admirable advice, but

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 10: Theory and practice in medical statistics

2546 P. ARMITAGE

not always fully e>ective. Recent theoretical work has been directed towards adjustments forsome of these de8ciencies, and I want to comment on some of these approaches.

3.2.1. Missing data. Formulae for the insertion of missing data are amongst the oldest toolsof the Fisherian school. Their primary purpose was to enable the least squares estimatesof parameters to be carried out by the traditional techniques of analysis of variance, butthey also automatically provided estimates of the missing readings themselves. It was alwaysrealized, though, that the replacement of missing data was a precarious occupation, since themissingness may have indicated that the reading would have been unusual. In terms thathave become familiar in recent years, the methods would be valid if the data were missing‘completely at random’ (the missingness being unrelated to any other data), or ‘at random’(missingness dependent on some covariates, but not otherwise on the intended response).Under these same conditions, more elaborate methods have now been devised, such as multipleimputation, where some of the uncertainties are covered by resampling methods. Here, careneeds to be taken to ensure that the resampling properly re4ects the relationship between theresponse and the predictive covariates.

The methods of Lavori et al. [27] for multiple imputation in trials with repeated measures,based on the concept of ‘propensity scores’ for missingness, seem to undervalue the predictivevalue of readings on the same individual but on di>erent occasions, as compared with readingson other individuals with similar propensity. The SOLAS 2.0 package (Statistical SolutionsLtd, Cork, Ireland) provides for model-based prediction using selected covariates, includingan individual’s readings on other occasions, as well as for the propensity method of Lavoriet al.

However, such discussions seem relatively pedantic in view of the fact that missing dataare more likely be ‘non-ignorable’; that is, the missingness is likely to be related to thehypothetical missing reading itself. An obvious example is the failure of a patient to appearat a clinic visit because of an exacerbation of illness which would have been likely to a>ectthe clinical observations and test results. At 8rst sight this seems an insoluble problem. Howcan we estimate the extent to which missingness depends on the intended observation, whenthe latter is unknown?

Diggle, Kenward and others [28; 29] have shown that some progress can be made bymodelling the missingness mechanism and the distributional form of the response. However,as Kenward [29] emphasizes, the result may depend crucially on features of the model that maybe impossible to verify from the data under analysis. The problem seems to call for sensitivityanalysis, to assess the robustness of conclusions to a range of plausible assumptions.

3.2.2. Compliance. The term ‘compliance’ may be used in more than one sense. It may referto the extent to which the investigators and participants adhere to the rules laid down in theprotocol. Non-compliance may then take various forms, including, for instance, incompleteadministration of medicines by the patients. In some, perhaps most, trials of drug treatments,the protocol may permit the cessation or modi8cation of drug consumption if there are goodreasons, such as the appearance of adverse e>ects, and modi8cations of this sort also may betermed ‘non-compliance’.

It is common to adopt an ‘intention-to-treat’ approach, whereby individuals are retainedin the groups to which they were randomized, whatever treatment modi8cations or protocol

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 11: Theory and practice in medical statistics

THEORY AND PRACTICE IN MEDICAL STATISTICS 2547

lapses occur. The comparison is then between two or more therapeutic strategies, based onideal schedules of treatment, rather than of these ideal schedules themselves.

This approach retains the bene8ts of randomization, and in my view is usually the bestoption, but it is sometimes thought to give the right answer to the wrong question. Sometrial analysts have therefore hoped to retrieve, from the somewhat untidy ‘intention-to-treat’data, an estimate of the comparison that would have been reached if there had been 100 percent compliance. A simple comparison of full compliers in each of two groups would bemisleading, because the selective e>ects might be quite di>erent in di>erent treatment groups,and the residual groups of compliers might di>er in important prognostic variables. However,some progress may be made by classifying patients as compliers, or as non-compliers ofvarious sorts, and modelling their responses. However, there are still hidden assumptions, suchas that the response of a patient on a treatment A is the same whether (s)he was assigned toA, or defaulted to A from another treatment assignment. This may be quite plausible in trialswith ‘hard’ endpoints such as survival, but less so for ‘soft’ endpoints such as pain relief.For further discussion see Armitage [30].

Compliance with drug treatment may be measured by the proportion of assigned dose ac-tually taken by a patient. Several authors [31; 32] have modelled the way in which responsedepends on this measure of compliance, and used the results to estimate the contrast be-tween treatments with 100 per cent compliance. Here again, there are hidden assumptions,for instance that a high complier with an active drug would also be a high complier with aplacebo.

Frangakis and Rubin [33] tackle both these topics – non-ignorable missing data and non-compliance – in the simple setting of a trial of new versus standard treatment with binaryresponses. Again, we 8nd that progress requires an unveri8able assumption, that missingnessis random within a compliance subgroup.

In my comments on both these topics – missing data and compliance – my intentionis not to discourage innovative research; rather the opposite. Sensitivity analyses will beespecially valuable. I merely want to suggest that the more features of a complex situationthat are included in a model, the more one becomes aware of residual de8ciencies. Which is,I suppose, another way of saying that research must go on, and that it is increasingly importantto marry the two traditions I have been discussing here: theory and practice. Either, withoutthe other, will leave a serious gap in our understanding of medical data; together, they willhelp to retain the important role of statistics in the interpretation of medical research.

REFERENCES

1. Healy MJR. Does medical statistics exist? Bulletin in Applied Statistics 1979; 6:137–182.2. Royal Statistical Society. Statistics and mathematics, papers by P. Sprent, D.J. Hand, S. Senn, and R.A. Bailey,

with discussion. Statistician 1998; 47:239–290.3. Stigler SM. The History of Statistics: The Measurement of Uncertainty before 1900. The Belknap Press of

Harvard University Press: Cambridge, Massachusetts, 1986.4. Hald A. A History of Probability and Statistics and their Applications before 1750. Wiley: New York, 1990.5. Hald A. A History of Mathematical Statistics from 1750 to 1930. Wiley: New York, 1998.6. Fienberg SE. A brief history of statistics in three and one-half chapters: a review essay. Statistical Science

1992; 7:208–225.7. Greenwood M. Medical Statistics from Graunt to Farr. Cambridge University Press: Cambridge, 1948.8. Geissler A. BeitrOage zur Frage des GeschlechtsverhOaltnisses der Geborenen. Zeitschrift des K2oniglichen

S2achsischen Statistichen Bureaus 1889; 35:1–24.9. Lancaster HO. The sex ratios in sibships with special reference to Geissler’s data. Annals of Eugenics, London,

1950; 15:153–158.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548

Page 12: Theory and practice in medical statistics

2548 P. ARMITAGE

10. Edwards AWF. An analysis of Geissler’s data on the human sex ratio. Annals of Human Genetics 1958;23:6–15.

11. Lindsey JK, Altham PME. Analysis of the human sex ratio by using overdispersion models. Applied Statistics1998; 47:149–157.

12. Hacking I. Prussian numbers 1860–1882. In The Probabilistic Revolution, Vol. I, Ideas in History, KrOuger L,Daston LJ, Heidelberger M (eds). Massachusetts Institute of Technology Press: Cambridge, Massachusetts,1987; 377–394.

13. Greenwood M. The Medical Dictator and Other Biographical Studies. Williams and Norgate, London, 1936.14. Fisher RA. The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the

Royal Society of Edinburgh 1918; 52:399–433.15. Priestley MB. Contribution to discussion on papers on ‘Statistics and Mathematics’. Statistician 1998; 47:281.16. Bross I. Sequential medical plans. Biometrics 1952; 8:188–205.17. Armitage P. Sequential tests in prophylactic and therapeutic trials. Quarterly Journal of Medicine 1954; 23:

255–274.18. Armitage P. Sequential Medical Trials. Blackwell Scienti8c Publications, Oxford, 1960 (2nd edn 1971).19. Jones DR, Whitehead J. Sequential forms of the log rank and modi8ed Wilcoxon tests for censored data.

Biometrika 1979; 66:105–113; correction Biometrika 1981; 68:576.20. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Ellis Horwood: Chichester, 1983 (2nd

edn 1992).21. Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64:191–199.22. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35:549–556.23. Piantadosi S. Clinical Trials: A Methodologic Perspective. Wiley: New York, 1997.24. Concorde Coordinating Committee. Concorde: MRC=ANRS randomised double-blind controlled trial of

immediate and deferred zidovudine in symptom-free HIV infection. Lancet 1994; 343:871–881.25. Armitage P on behalf of the Concorde and Alpha Data and Safety Monitoring Committee. Data and safety

monitoring in the Concorde and Alpha trials. Controlled Clinical Trials 1999; 20:207–228.26. Lewis JA, Machin D. Intention to treat – who should use ITT? British Journal of Cancer 1993; 68:647–650.27. Lavori PW, Dawson R, Shera D. A multiple imputation strategy for clinical trials with truncation of patient

data. Statistics in Medicine 1995; 14:1913–1925.28. Diggle PJ, Kenward MG. Informative dropout in longitudinal data analysis (with discussion). Applied Statistics

1994; 43:49–94.29. Kenward MG. Selection models for repeated measurements with non-random dropout: an illustration of

sensitivity. Statistics in Medicine 1998; 17:2723–2732.30. Armitage P. Attitudes in clinical trials. Statistics in Medicine 1998; 17:2675–2683.31. Efron B, Feldman D. Compliance as an explanatory variable in clinical trials (with discussion). Journal of the

American Statistical Association 1991; 86:9–26.32. Goetghebeur E, Lapp K. The e>ect of treatment compliance in a placebo-controlled trial: regression with

unpaired data. Applied Statistics 1997; 46:351–364.33. Frangakis CE, Rubin DB. Addressing complications of intention-to-treat analysis in the combined presence of

all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika 1999; 86:365–379.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2537–2548