13
Why Generalisability is not Generalisable LYNN FENDLER In the United States there is an increasing tendency to view the only educational research worthy of federal funding as that which is designed as an experiment using randomised controls. One of the foundational assumptions underlying this research design is that the results of such research are meant to be generalisable beyond any particular research study. The purpose of this paper is to historicise the assumption of generalisability by explaining the ways in which it is a particularly modern research project. By historicising generalisability, I show the ways in which the current research standards are products of culturally specific historical circumstances. In other words, generalisability is a local phenomenon and not generalisable to other times and places. In the United States, the current federal standards for educational research are found in publications of the What Works Clearinghouse (US Department of Education, http://www.whatworks.ed.gov/; hereafter, WWC). These research standards are based almost exclusively on the book Experimental and Quasi-experimental Designs for Research by Campbell and Stanley (1963). According to WWC, the only educational research worthy of federal funding is that which can establish causality, in particular, the degree to which an educational intervention (curricular or pedagogical) causes improvement in ‘student achievement’ as measured by standardised test scores. In order to be able to establish a basis for causality, the research designs that are acceptable by the WWC are ‘a randomized controlled experiment (RCT), a quasi-experiment with matching (QED), or a regression discontinuity design (RD)’ (WWC, Criteria for Evaluation). In each case, the basis for making causal claims must be established by statistical tests designed to eliminate confounding variables from the analysis. When research designs meet these standards, they are called ‘scientific’ by the WWC. This is a particular, and historically specific, definition of science. The type of educational research supported by the WWC is not ‘scientific’ in the sense of seeking a deeper and more complex intellectual understanding of educational phenomena. Rather, WWC research is ‘scientific’ in the sense of establishing policy and efficient management of people and resources. Journal of Philosophy of Education, Vol. 40, No. 4, 2006 r 2006 The Author Journal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

Fendler 2006 Generalizability

Embed Size (px)

DESCRIPTION

AbstractIn the United States there is an increasing tendency to view the only educational research worthy of federal funding asthat which is designed as an experiment using randomised controls. One of the foundational assumptions underlying thisresearch design is that the results of such research are meant to be generalisable beyond any particular research study.The purpose of this paper is to historicise the assumption of generalisability by explaining the ways in which it is a particularly modern research project. By historicising generalisability, I show the ways in which the current research standards are products of culturally specific historical circumstances. In other words, generalisability is a local phenomenon and not generalisable to other times and places.

Citation preview

Page 1: Fendler 2006 Generalizability

Why Generalisability is not Generalisable

LYNN FENDLER

In the United States there is an increasing tendency to viewthe only educational research worthy of federal funding asthat which is designed as an experiment using randomisedcontrols. One of the foundational assumptions underlying thisresearch design is that the results of such research are meantto be generalisable beyond any particular research study.The purpose of this paper is to historicise the assumptionof generalisability by explaining the ways in which it isa particularly modern research project. By historicisinggeneralisability, I show the ways in which the currentresearch standards are products of culturally specifichistorical circumstances. In other words, generalisability isa local phenomenon and not generalisable to other timesand places.

In the United States, the current federal standards for educational researchare found in publications of the What Works Clearinghouse (USDepartment of Education, http://www.whatworks.ed.gov/; hereafter,WWC). These research standards are based almost exclusively on thebook Experimental and Quasi-experimental Designs for Research byCampbell and Stanley (1963). According to WWC, the only educationalresearch worthy of federal funding is that which can establish causality, inparticular, the degree to which an educational intervention (curricular orpedagogical) causes improvement in ‘student achievement’ as measuredby standardised test scores.

In order to be able to establish a basis for causality, the research designsthat are acceptable by the WWC are ‘a randomized controlled experiment(RCT), a quasi-experiment with matching (QED), or a regressiondiscontinuity design (RD)’ (WWC, Criteria for Evaluation). In each case,the basis for making causal claims must be established by statistical testsdesigned to eliminate confounding variables from the analysis. Whenresearch designs meet these standards, they are called ‘scientific’ by theWWC. This is a particular, and historically specific, definition of science.The type of educational research supported by the WWC is not ‘scientific’in the sense of seeking a deeper and more complex intellectualunderstanding of educational phenomena. Rather, WWC research is‘scientific’ in the sense of establishing policy and efficient management ofpeople and resources.

Journal of Philosophy of Education, Vol. 40, No. 4, 2006

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain. Published by BlackwellPublishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

Page 2: Fendler 2006 Generalizability

In the WWC definition of acceptable research, there must berandomised control of participants. The purpose of this design is that itsfindings are supposed to be generalisable to a larger number of people thathave demographic characteristics similar to those ‘represented’ in thestudy. In that sense, the findings derived from one research setting aresupposed to have some relevance in other settings. These standards forresearch design are gaining favour among funding agencies internation-ally. Since the publication of these US federal mandates, some universityresearchers have criticised the standards for being antiquated and narrowin scope, as well for being both methodologically and ethicallyinappropriate for research in education.1

This paper extends previous critiques of the WWC research standardsby pointing out the problems in their assumptions about what is‘scientific’. In this examination of generalisability, I explain the waysthat the WWC standards for research:

� confound induction and prediction;� confuse probability with certainty;� conflate science with social management.

These problems with WWC standards emerge from the assumption thateducational research findings are generalisable. On that basis, I argue inthis paper that generalisability is:

� an example of inductive reasoning and therefore subject tolongstanding analytic critiques of induction;

� a form of stochastic, not deterministic, logic;� a way of thinking that is historically specific to modernity and linked

to modern projects of social governance.

In order to examine generalisability from different perspectives, I havecompiled critical arguments from three disparate traditions: classicalanalytic philosophy, statistical modelling and histories of social science. Iexplicate each of these perspectives in turn and then conclude bysuggesting that generalisability is a limited and local phenomenon, notgeneralisable to other times and places.

ANALYTIC CRITIQUES OF INDUCTION

In educational research, generalisation is an example of inductive thinkingbecause it is a process that seeks to find an overall pattern across an arrayof specific examples. Analytic philosophers have examined induction inways that provide insight about the processes of generalisation. In thissection, I summarise classical critiques of induction by David Hume, JohnStuart Mill and Bertrand Russell and draw connections between theiranalyses and current assumptions about generalisability.

David Hume’s sceptical doubts about human understanding provide uswith a curious insight into the meaning of induction. Hume characterises

438 L. Fendler

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 3: Fendler 2006 Generalizability

induction in a particular way by using the verb in a passive sense.Describing our understanding of patterns in nature, Hume writes, ‘we areinduced to expect effects similar to those, which we have found to followfrom such objects’ (Hume, 1777/1977, p. 23; § IV, Part 2). Hume’sargument has two major points: 1) we can experience objects, but therelations between objects are inferred, not experienced; 2) reason alwaysrenders the same conclusions from the same premises, and reason cannotaccount for changes in historical circumstances. From those points, Humecharacterises induction as a product of custom or habit and distinguishesthat from reason or logic. In making this distinction, Hume does notdiminish the value of induction-by-custom, but he does say that suchinduction is not a kind of reason: ‘Custom, then, is the great guide ofhuman life. It is that principle alone, which renders our experience usefulto us, and makes us expect, for the future, a similar train of events withthose which have appeared in the past’ (p. 29, § V, Part 1). Hume’scontribution to a critique of generalisability is his explanation thatgeneralisations are derived from habit or custom, rather than from reason.According to Hume’s famous example, we become accustomed to the ideathat the sun rises every morning; however, our belief that the sun will risetomorrow is based on our customary experience of the world, not on anyabstract principle like the commutative property of addition. Thecustomary nature of generalisation becomes salient when we considerthe degree to which generalisation may or may not be able to produce anynew ideas or concepts.

John Stuart Mill (1843/1964) argues that the grounds for induction arethemselves inductive: ‘To test a generalisation, by showing that it eitherfollows from, or conflicts with, some stronger induction, some general-isation resting on a broader foundation of experience is the beginning andend of the logic of Induction’ (p. 284). Mill’s contribution to a critique ofgeneralisability is his observation that individual facts and generaltruths are not different in kind, and we derive both in the same way:‘the process of indirectly ascertaining individual facts is as truly inductiveas that by which we establish general truths. But it is not a differentkind of induction; it is a form of the very same process’ (p. 275). Millcharacterises both data and generalisations as products of induction, andhis analysis provides us with a basis for recognising the recursive orreiterative relationship between facts and generalisations.

Bertrand Russell (1914/1964) calls induction a principle and in hisexplication reiterates Hume’s example of the sun rising every day. ButRussell’s critique of induction includes appeals to historical occurrences(synthetic premises). For example, when describing the problem withregarding the rising of the sun as a law of nature, Russell raises thepossibility that the earth may one day collide with a large body. In thatcase, as Russell demonstrates, the prediction of the rising sun is not basedon reason, but rather on experience, which proves nothing. In anotherearthy example, Russell writes: ‘The man who has fed the chicken everyday throughout its life at last wrings its neck instead, showing that morerefined views as to the uniformity of nature would have been useful to the

Why Generalisability is not Generalisable 439

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 4: Fendler 2006 Generalizability

chicken’ (Russell, 1914/1964, p. 307). Russell’s analysis provides us witha compelling distinction between the probability that induction offers andthe certainty of a law of nature: ‘the fact that two things have been foundoften together and never apart does not, by itself, suffice to provedemonstratively that they will be found together in the next case weexamine . . . if they have been found together often enough, the probabilitywill amount to almost certainty’ (p. 308, italics in original).

The WWC research standards rest on the assumption that prediction isbased on induction. Granted, inductive findings have some probability forpredictive value. However, educational policy is generally not articulatedin terms of conditional probability. Policy language attempts to be clearlydirective; it is the function of policy to tell people what to do and not tooutline conditional alternatives in a philosophical sense. In discourse,there is high predictive value associated with research findings thatpurport to establish causality. When experimental and quasi-experimentalresearch designs in education are used to inform policy—that is, whenresearch findings are held to be generalisable from one setting toanother—that practice confuses induction with prediction. There is alsothe related problem of certainty—the focus of the next section.

STATISTICAL MODELING AND CLINICAL APPLICATIONS

In the conduct of educational research, the logic of generalisation is madeexplicit in the mechanisms of statistical analysis. For some insight intothese mechanisms, I turn to discussions of statistical modelling, whichinclude both theory (including mathematics and logic) and application(including computer programming and clinical research). Literature onstatistical modelling makes careful distinctions among different kinds ofgeneralisability. These distinctions are formulated around problems ofcontextual differences (sometimes called facets, factors or variables) andaround conditions of purpose or application. The following examples willaccentuate the importance of understanding generalisability as a stochastic(that is, probabilistic or conjectural) system.

Some theories of statistical modelling include critiques quite similar toRussell’s example of the chicken:

For example, if you want to forecast the price of a stock, a historicalrecord of the stock’s prices is rarely sufficient input; you need detailedinformation on the financial state of the company as well as generaleconomic conditions, and to avoid nasty surprises, you should alsoinclude inputs that can accurately predict wars in the Middle East andearthquakes in Japan (http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-1.html).

In many cases, the methodological discussions about statistical modellingare those that try to explain—and hopefully account for—finer and finerdegrees of error. Across a wide range of discussions about statisticalmodelling, the probabilistic, and therefore undependable, nature of

440 L. Fendler

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 5: Fendler 2006 Generalizability

statistical generalisation is emphasised. The stochastic nature of statisticaltesting is highlighted on several dimensions. For example, study designhas been called an art because of the complexities involved in balancingthe trade off between good fit and generalisability (see, for example,Myung, Balasubramanian and Pitt, 2000; Koopman, 1999). Thompson(1989) calls attention to the fact that statistical significance, resultimportance and result generalisability are three different issues, and hecautions us against confounding these three things. Shavelson (1976;2003) is famous for promoting statistics-based research in educationprecisely by emphasising the ways in which activities like teaching arenotoriously slippery and difficult to quantify. Citing high-poweredresearch studies, Shavelson and Dempsey-Atwood write: ‘Consistentconclusions from research on teaching are that teacher effects on pupiloutcomes are unstable . . . , that teaching acts themselves may be unstable. . . , and that most teaching acts are unrelated to student outcomes’(Shavelson and Dempsey-Atwood, 1976, p. 554).

Generalisability theory in education was formalised by Lee Cronbachet al. (1963; 1972) as a way to address the conflict in classical statisticaltheory between reliability and validity. In classical statistical theory, thereis a tension between a good fitting explanation and a generalisableexplanation. Myung, Balasubramanian and Pitt (2000) put it this way:‘The trademark of a good model selection procedure is its ability to satisfythese two opposing goals. We desire a model that is complex enough todescribe the data sample accurately but without overfitting and thus losinggeneralizability’ (online version).

Similarly, Koopman (1999), working in the field of epidemiology,emphasises that there are two different types of generalisability:

Scientific generalizability refers to the validity of applying parameterestimates made in a particular statistical target population to otherpopulations. This must be distinguished from statistical generalizabilitywhich refers to the validity with which estimates made in a samplepopulation can be applied to the statistical target population. Statisticalgeneralizability so defined relates to what epidemiologists most often callvalidity. When epidemiologists talk with statisticians, miscommunicationis likely because when epidemiologists speak of generalizability they areusually thinking about scientific generalizability while when statisticiansspeak of generalizability, they are much more likely to be referring tostatistical generalizability (online version, §24).

Koopman’s distinction between scientific generalisability and statisticalgeneralisability is indicative of a more general confusion betweenstatisticians’ discourse and practitioners’ discourse. Statistical probabilitycan be strictly valid and yet not address the concerns of practitioners. Theparameters of the study may not fit with practitioners’ clinical/classroomparameters, and historical facets may not pertain. Another way to say thisis that generalizability is not generalizable from the field of statistics to thefield of clinical practice.

Why Generalisability is not Generalisable 441

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 6: Fendler 2006 Generalizability

However, in the current educational research climate, research is meantto provide direction for policy. For the WWC, ‘scientific’ means useful forinforming educational policy. ‘Scientific’ is conflated with administrativeor managerial. In a policy discourse, then, the climate is not favourable forintellectual nuance, conceptual subtlety or accommodation of contextualvariations. Educational policy discourse converts probability to certaintyin the process of decision-making. This managerial feature of educationalresearch is characteristic of modern social science projects (see, forexample, Heilbron, 1995; Porter, 1986; Wagner, 1994).

In sum, within statistical theory, generalisation is unquestionably astochastic process. Therefore, within statistical modelling, there is no basisfor trust or certainty in the generalisability of findings; probability isprecisely not certainty. However, in the mandates of the What WorksClearinghouse, an assumption of generalisability underlies the require-ments for experimental studies that lead to causal explanations andgeneralisable findings. That assumption is an unfortunate and misguidedattempt to generalize from statistical theory to practical application, anattempt that is not warranted by the standards of statistical theory itself.An experimental form of study design has become a required basis forfunding research in education. Although statistical modeling is fundamen-tally probabilistic, the uptake in educational policy endows the findingswith a sense of certainty. When educational policy makers base theirdecisions on the evidence provided by WWC-accepted studies, findingsare converted from probabilities to policies; the conversion entails atransformation from statistical generalizability to scientific general-izability. In those conversions, pedagogical practices like Reading Firstare rendered the ‘gold standard’ of educational practice.

The mobilisation of probabilistic findings as if they were certainties is,ironically, an irrational and unscientific thing to do. As the previoussections show, analytic philosophers have repeatedly emphasised thatgeneralisation is shaped by habit and convention; and statistical theorydraws a clear distinction between certainty and probability. Statisticalmodelling is done in order to obtain progressively closer approximationsof causal determination, but statistical tests are stochastic, not determi-nistic. For both analytic philosophers and statisticians, then, generalisationis an uncertain foundation upon which to base decisions about educationalpractice (practice of any kind). It is clear that treating probabilities ascertainties is an unscientific thing to do. Therefore, the interestinghistorical question is to figure out how probability-in-theory got convertedinto certainty-in-practice. In other words, what historical conditions havesupported trust in numbers2 and the use of statistical evidence for researchin education to the extent that no other kind of educational research isfundable by the US government?

HISTORY OF THE SOCIAL SCIENCES3

The histories of the social sciences provide analyses of the conditionsunder which generalisability became fashionable (from about the middle

442 L. Fendler

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 7: Fendler 2006 Generalizability

of the 19th century).4 Generalisability was supported and nourished byseveral historical factors that characterise modernity. Some of theseconditions include the popularity of measurement as a technology ofgoverning society, invention of statistics, the proliferation of law-likethinking in social sciences, the conflation of objectivity and generalisa-bility, and examples of cases in which statistics (5state arithmetic) weredeployed as justifications for ways of managing people.

The history of statistics, compiled by the statistician Pearson (1978) isilluminating in this respect. Statistik was nothing more than knowledge ofstatecraft, completely devoid of any trace of mathematics, in the times ofGauss, the writer of normal distribution:

Gottfried A. Achenwall in 1752, for the first time . . . introduced the word‘Statistik’ as the name of a distinct branch of knowledge. That branch wasnot concerned with numbers, nor with mathematical theory; it wasconcerned with statecraft and to a less extent with political economy. . . .From Achenwall sprang the German school of statisticians at Gotingenwho in our modern sense were rather political economists (Pearson,1978).

Meanwhile a concomitant development in the United Kingdom was thefounding of the English school of ‘Political Arithmetic’ by John Graunt(1620–1674) and Sir William Petty. They never used the word ‘statistik’nor called their data ‘statistics’. ‘A hundred years or more later . . . aScotsman, Sir John Sinclair ‘stole’ the words ‘Statistics’ and ‘Statistik’and applied them to the data and methods of Political Arithmetic’(Pearson, 1978). Pearson calls this a ‘bold, barefooted act of robbery’. Thenormal curve, part and parcel of this bold and barefooted act of robbery,was available then to Adolphe Quetelet to ground his idea of the ‘averageman’.

It is Adolphe Quetelet, the Astronomer-Royal of Belgium in the 1830s-40s, who accomplished the inscription of a bell curve in the buddingdiscourse of social theory that objectified man as its object. We can see inQuetelet’s work an example of the kind of thinking that was characteristicof the mid-19th century and that coalesced to comprise the emergingdiscipline (or disciplines that fell under the rubric of social sciences) ofsocial science (Heilbron, 1995; Hilts, 1981). Hacking tells the story thisway:

Quetelet changed the game. He applied the same curve to biological andsocial phenomena where the mean is not a real quantity at all, or rather: hetransformed the mean into a real quantity. It began innocently enough. Ina series of works of the early 1830s he gave us ‘the average man’. Thisdid not of itself turn averages—mere arithmetical constructions—into realquantities like the position of a star. But it was a first step (Hacking, 1990,pp. 106–107).

After Quetelet, the history of the statistics is a story of how a strictlymathematical representation of numerical probability got appropriated by

Why Generalisability is not Generalisable 443

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 8: Fendler 2006 Generalizability

social scientists who then began generating numerical descriptions ofpopulations. Data proliferated as more things about people got counted,numerical descriptions of the ‘average man’ were formulated and revised,immigration and industrialisation increased in complexity and statisticsincreasingly served as a technology by which government offices couldrationalise systems of population management, diagnosis and intervention.The ‘long baking process of history’ has gradually obliterated the debatesby early social scientists about the questionable value of applyingmathematical formulae to an understanding of human society. Whenstatistical representations made the concept of an ‘average man’ thinkable,that was a necessary step in making the notion of generalisability possible.

At this stage, the relationship of the statistics to expectations of humanbehaviour was still innovative and debatable as a goal for social science.In fact, in a notable irony of history, one of the strongest argumentsagainst the use of statistics to talk about human society was launched byAuguste Comte, French sociologist and political philosopher and thefounder of positivism. Comte’s opposition to statistics was voiced at thesame time by the French mathematician Louis Poinsot and also on theother side of the English Channel by John Stuart Mill. Poinsot wrote: ‘theapplication of this calculus to matters of morality is repugnant to the soul.It amounts, for example, to representing the truth of a verdict by a number,to thus treat men as if they were dice, each with many faces, some forerror, some for truth’ (quoted in Stigler, 1986, p. 194, italics in original).Mill’s description was even more damning when he referred to statisticalwork as, ‘the real opprobrium of mathematics’ (quoted in Stigler, 1986,p. 195). Comte expressed his own disgust with the whole idea of amathematical basis for social science by calling the enterprise ‘irrational’:

It is impossible to imagine a more radically irrational conception than thatwhich takes for its philosophical base, or for its principal method ofextension to the whole of the social sciences, a supposed mathematicaltheory, where, symbols being taken for ideas (as is usual in purelymetaphysical speculation), we strain to subject the necessarily compli-cated idea of numerical probability to calculation, in a way that amountsto offering our own ignorance as a natural measure of the degree oflikelihood of our various opinions? (Comte, quoted in Stigler, 1986,pp. 194–195).

The application of the normal curve to human sciences was not free oftrouble. Careful examinations of Quetelet’s ‘average man’ yielded theobservation that the whole statistical exercise was circular. In order tocount and compare human characteristics, it is first necessary to specify acharacteristic in discrete terms, and second, it is necessary to regard theparticular group from whom data were collected as somehow homo-geneous—there cannot be too many variables. Quetelet’s first studiesentailed fairly unambiguously discrete items including births, deaths,heights and weights. However, it is easy to see that the characteristicsthemselves are based on preconceived perceptions of differences. In other

444 L. Fendler

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 9: Fendler 2006 Generalizability

words, Quetelet’s statistical analyses had no means of testing orquestioning the existing characteristics. The analysis is bound by itsoriginal categories:

the tool he [Quetelet] had created was too successful to be of use for itsoriginal purpose. The fitted distributions gave such deceptively powerfulevidence of a stable homogeneity that he could not look beyond them todiscover that further subclassification could produce other distributions ofthe same kind, that some normal distributions are susceptible to dissectioninto normal components. The method was lacking in discriminatorypower; too many data sets yielded evidence of normality. Few newclassifications were uncovered; the primary use of the method was as adevice for validating already determined classification schemes (Stigler,1986, p. 215, italics added).

Here, J. S. Mill’s analytic arguments about the recursive nature ofinduction can be seen in their historical context. Mill was a fan of Comte,and together they deplored the use of statistics to represent humancharacteristics.

Given the strength of the opposition, it was certainly not inevitable thatstatistical constructions would eventually be transformed into real andessential populational attributes. But the historical conditions in the latterpart of the 19th century were strong enough to overcome the earliertendencies against describing human characteristics in terms of symbolicentities. The confluence of developments in government, industry,statistics and social sciences fostered yet another transformation:

It was Quetelet’s less-noticed next step, of 1844, that counted far morethan the average man. He transformed the theory of measuring unknownphysical quantities, with a definite probable error, into the theory ofmeasuring ideal or abstract properties of a population. Because thesecould be subjected to the same formal techniques they became realquantities. This is a crucial step in the taming of chance. It began to turnstatistical laws that were merely descriptive of large-scale regularities intolaws of nature and society that dealt in underlying truths and causes(Hacking, 1990, p. 108).

It is the invention of psychology that served as the vehicle by whichstatistical analyses became acceptable as tools to study human beings.Before the 1880s, researchers regarded various towns and villages asbeing so idiosyncratic that generalisations across them would bemeaningless; there were simply too many extenuating circumstances—from weather, to religion, to custom, language and local history—for it tobe possible to apply principles across various cases. But conceptualinnovations in statistical theory contributed to overcoming thesemethodological roadblocks. From Quetelet onward, most statisticiansaccepted this distinction, even though ‘the distinction was blurred andsomewhat circular’ (Stigler, 1986, p. 256). But the distinction becamefurther blurred as mathematicians developed methodological surrogates

Why Generalisability is not Generalisable 445

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 10: Fendler 2006 Generalizability

for experimental control. Francis Galton’s innovation in 1869, which hecalled ‘statistics by intercomparison’ claimed to be able to measure talentas easily as height, and the argument rested entirely on the logic ofanalogy:

Galton turned Quetelet’s phenomenon to novel use. If data from the samespecies arrayed themselves according to this curve and if the unity of thespecies could be demonstrated by showing that measurable quantitiessuch as stature or examination scores followed the curve, then, once sucha species was identified on the basis of measurable quantities, the processcould be inverted with respect to qualities that eluded direct measure-ment! . . . the use to which the scale was put was clear and somewhatironic. Where Quetelet had used the occurrence of a normal curve todemonstrate homogeneity, Galton based a dissection of the populationupon it. Using this inferred scale, he could distinguish between men’sabilities on a numerical scale rather than claim that they wereindistinguishable (Stigler, 1986, p. 271).

Galton’s argument by analogy helped to make it possible to considerpreviously immeasurable qualities as discrete entities that could becounted and graphed. As qualities such as cleverness, morality, wit andcivility were appropriated into statistical arrays, these qualities graduallybecame standardised and reified. As they become more essentialised indiscourse, it made more sense to regard those qualities as natural. In thatway, the qualities of morality and humour were no longer problematic, butrather they became the categorical bases by which the preconceivedclassifications of race and gender could be validated.

So it is the 19th century doubling of man into a subject and object ofscientific reflection that saw a transformation in the understanding ofnormal distribution from a statement about the regularities of arithmeticalprobability to an insight into the workings of society. This transformationis part of a large-scale modernisation that includes the invention of thesocial sciences and the advent of the common school.

One consequence of the ‘doubling of man into subject and object’ hasbeen what Hacking (1995) calls the ‘looping effect’ of human kinds.Hacking introduces his concept of looping in the context of a book aboutcausal reasoning. By looping, Hacking means that as categories getconstructed from generalisations, causal connections are made to ‘explain’the categorical phenomenon. In discourse, these categories and causesabout human kinds become part of knowledge, of which we humans areboth subject and object. Hacking provides examples from psychiatry andpaediatrics to illustrate that the relationship between classification and theattribution of cause is a chicken-and-egg loop:

Which comes first, the classification or the causal connections betweenkinds? . . . to acquire and use a name for any kind is, among other things,to be willing to make generalizations and form expectations about thingsof that kind. . . . The kind and the knowledge grow together . . . [in the

446 L. Fendler

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 11: Fendler 2006 Generalizability

case of paediatric X-rays and child abuse] cause, classification, andintervention were of a piece (Hacking, 1995, p. 361).

Hacking’s historical treatment of looping introduces another level—adiscursive level—of reiteration to processes of generalisation. That is, notonly is generalisation reiterative in an analytic sense, it is also reiterativein a discursive/historical sense. Induction is shaped by habit and custom,and now generalisability has itself become a habitual expectation thatcontinues to validate belief in itself.

CONCLUSION

Educational research standards such as those stipulated by the WhatWorks Clearinghouse are based on theoretical premises that confoundinduction and prediction, confuse probability with certainty and conflatescience with social management. These specific tendencies characterisegeneralisability in educational research as a modernist project. Anotherindication of the historical specificity of generalisability is that it iscurrently being replaced by other approaches. More recent literature onresearch theory is indicating the emergence of scientific approaches tostudies that are not focused on generalisability. These newer approacheshave been called Bayesian5 (by Hacking, 2001) and design studies (byShavelson et al., 2003). Both of these approaches reject the static andalgorithmic science of generalisability and replace that approach with adynamic and reiterative scientific approach. For example, design studieswere showcased in a recent issue of the Educational Researcher: ‘Designstudies have been characterized, with varying emphasis depending on thestudy, as iterative, process focused, interventionist, collaborative, multi-leveled, utility oriented, and theory driven’ (Shavelson, et al., 2003, p. 26).Similarly, Hacking describes Bayesian probability saying, ‘The Bayesiandid not claim that premises give sound reasons for an inductiveconclusion. Instead, it was claimed that there are sound reasons formodifying one’s opinions in the light of new evidence’ (Hacking, 2001,p. 262). These newer approaches to research abandon the presumption ofgeneralisability as a desirable criterion for scientific study. The newerapproaches are historically commensurate with post-analytical, post-empiricist and more pragmatic intellectual endeavors.

By analysing and historicising generalisability, this paper has sought toproblematise the assumptions upon which the vast majority of fundableresearch standards are based. The array of analytic, statistical, andhistorical critiques combine to suggest that generalisability is anhistorically specific phenomenon that was invented under particularcircumstances and became an indispensable feature of educationalresearch in spite of rational and logical arguments against such over-reliance. One of the effects of an overweening belief in the generalisabilityof research findings has been to narrow the scope of intellectual andscientific inquiry. Another effect has been to reinforce the ‘looping’

Why Generalisability is not Generalisable 447

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 12: Fendler 2006 Generalizability

effects of categorising human behaviour. Finally, since generalisation ispart of a reiterative process, generalisability in educational research seemsmore likely to provide us with validation of our preconceived notions andis less likely to contribute anything new for us to learn.

Correspondence: Lynn Fendler, Department of Teacher Education,Michigan State University, 116I Erickson Hall East Lansing, MI 48824-1034, USA.Email: [email protected]

NOTES

1. See, for example, St. Pierre (2004); Shavelson, Phillips, Towne and Feuer (2003); Shavelson and

Dempsey-Atwood (1976).

2 See, for example, Porter, 1995.

3. My thanks to Irfan Muzaffar for his research contributions in this section.

4. For this discussion I draw from several histories of statistics and histories of social sciences, for

example, Daston, 1988; Hacking, 1990; Heilbron, 1995; Hilts, 1981; Popkewitz, 2001; Porter,

1986 and 1995; Stigler, 1986.

5. So named after Rev. Thomas Bayes (1702-1761) who published essays on probability and

induction.

REFERENCES

Campbell, D. T. and Stanley, J. C. (1963) Experimental and Quasi-Experimental Designs for

Research (Boston, Houghton Mifflin).

Cronbach, Lee J., Nageswari, R. and Gleser, G. C. (1963) Theory of Generalizability: A Liberation

of Reliability Theory, The British Journal of Statistical Psychology, 16, pp. 137–163.

Cronbach, Lee J., Gleser, G. C., Nanda, H. and Rajaratnam, N. (1972) The Dependability of

Behavioral Measurements: Theory of Generalizability for Scores and Profiles (New York, John

Wiley).

Daston, L. (1988) Classical Probability in the Enlightenment (Princeton, NJ, Princeton University

Press).

Hacking, I. (1990) The Taming of Chance (Cambridge, Cambridge University Press).

Hacking, I. (1995) The Looping Effects of Human Kinds, in: D. Sperber, D. Premack and A. J.

Premack (eds) Causal Cognition: A Multidisciplinary Debate (Oxford, Clarendon Press),

pp. 351–394.

Hacking, I. (2001) An Introduction to Probability and Inductive Logic (Cambridge, Cambridge

University Press).

Heilbron, J. (1995) The Rise of Social Theory. S. Gogol, trans. (Minneapolis, MN, University of

Minnesota Press).

Hilts, V. L. (1981) Statist and Statistician (New York, Arno Press).

Hume, D. (1777/1964) An Enquiry Concerning Human Understanding, E. Steinberg, ed.

(Indianapolis, IN, Hackett Publishing).

Koopman, J. (1999) Balancing Validity, Precision, Generalizability and Importance in the Design

of Epidemiological Investigations, Epid655 Lecture of 1–22–99, http://www.sph.umich.edu/

epid/epid655/TradeOffNotes.htm. (Accessed August 21, 2005)

Mill, J. S. (1843/1964) Inductive Grounds for Induction (from A System of Logic), in: I. M. Copi

and J. A. Gould (eds) Readings on Logic (New York, Macmillan), pp. 275–286.

Myung, J., Balasubramanian, V. and Pitt, M. A. (2000 October 10) Counting Probability

Distributions: Differential Geometry and Model Selection, Proceedings of the National

Academy of Sciences, 97.21, pp. 11170–11175. Published online 2000 September 26. Accessed

August 20, 2005.

Pearson, K. (1978) The History of Statistics in the 17th and 18th Centuries Against the changing

Background of Intellectual, Scientific and Religious Thought: Lectures by Karl Pearson given at

448 L. Fendler

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain

Page 13: Fendler 2006 Generalizability

the University College London during the Academic Sessions 1921–1933 (New York,

Macmillan Publishing).

Popkewitz, T. S. (2001). Educational Statistics as a System of Reason: On Governing Education

and Social Inclusion and Exclusion. Conference Proceedings: Philosophy and History of the

Discipline of Education: Evaluation and Evolution of the Criteria for Educational Research.

Onderzoeksgemeenschap Research Community. P. Smeyers and M. DePaepe, directors

(Leuven, Belgium October 24–26, 2001).

Porter, T. M. (1986) The Rise of Statistical Thinking 1820–1900 (Princeton, NJ, Princeton

University Press).

Porter, T. M. (1995) Trust in Numbers: The Pursuit of Objectivity in Science and Public Life

(Princeton, NJ, Princeton University Press).

Russell, B. (1914/1964) The Principle of Induction (from The Problems of Philosophy), in: I. M.

Copi and J. A. Gould (eds) Readings on Logic (New York, Macmillan), pp. 306–310.

Shavelson, R. and Dempsey-Atwood, N. (1976) Generalizability of Measures of Teaching

Behavior, Review of Educational Research, 46.4, pp. 553–611.

Shavelson, R., Phillips, D. C., Towne, L. and Feuer, M. J. (2003) On the Science of Education

Design Studies, Educational Researcher, 32.1, pp. 25–28.

Stigler, S. M. (1986) The History of Statistics: The Measurement of Uncertainty before 1900

(Cambridge, MA, The Belknap Press of Harvard University Press).

Thompson, B. (1989) Statistical Significance, Result Importance, and Result Generalizability:

Three Noteworthy but Somewhat Different Issues, Measurement and Evaluation in Counseling

and Development, 22 April, pp. 2–6.

Wagner, P. (1994) A Sociology of Modernity: Liberty and Discipline (New York, Routledge).

What Works, Criteria for Evaluation. http://www.whatworks.ed.gov/reviewprocess/standards.html

Why Generalisability is not Generalisable 449

r 2006 The AuthorJournal compilation r 2006 Journal of the Philosophy of Education Society of Great Britain