25
151 DOI: 10.1037/13937-007 APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved. C HAPTER 7 GENERALITY AND GENERALIZATION OF RESEARCH FINDINGS Marc N. Branch and Henry S. Pennypacker For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication. (Cohen, 1994, p. 997) Confirmation comes from repetition. . . . Repetition is the basis for judging . . . sig- nificance and confidence. (Tukey, 1969, pp. 84–85) As the general psychology research community becomes increasingly aware (e.g., Cohen, 1994; Lof- tus, 1991, 1996; Wilkinson & Task Force on Statisti- cal Inference, 1999) of the limitations of traditional group designs and statistical inference methods with regard to assessing reliability and generality of research findings, we present an alternative approach that has been substantially developed in the branch of psychology now known as behavior analysis. In this chapter, we outline how individual subject methods, that is, so-called single-case designs, provide straight- forward and, in principle, simple methods to assess the reliability and generality of research findings. OVERVIEW The chapter consists of three major sections. In the first, we summarize the limitations of tradi- tional methods, especially as they relate to assess- ing reliability and generality of research findings concerning behavior. We make the case that tradi- tional methods have obscured an important dis- tinction that has led to psychology’s consisting of two related, but separable, subject matters, behav- ioral science and actuarial science. We also focus on the issue of generality across individuals and how traditional methods can give the illusion of such generality. In the second major section, we discuss dimensions of generality in addition to generality across individuals. Here we define scien- tific generality and several other forms of generality as well. In so doing, we introduce the roles of rep- lication, both direct and systematic, in assessing generality of research results. We argue that repli- cation, instead of statistical inference, is an alter- native primary method for determining not only the reliability of results but also for assessing and characterizing the generality of scientific findings. In the third major section, we discuss generaliza- tion of treatment effects, the fundamentals of tech- nology transfer, and the practices that characterize translational research. There, we write of program- ming for and assessment of generalizability of sci- entific findings to applied settings. We expand our view then to the engineering issues of technology development (or technology transfer and transla- tional research) as a capstone demonstration of generalization based on an understanding of gen- erality of research findings. LIMITATIONS OF TRADITIONAL METHODS The traditional group-mean, statistical-inference approach to analyzing research results has faced Preparation of this chapter was supported by National Institute on Drug Abuse Grant DA004074.

7 Branch&Pennypacker Generalization and Generality

Embed Size (px)

Citation preview

Page 1: 7 Branch&Pennypacker Generalization and Generality

151

DOI: 10.1037/13937-007APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles, G. J. Madden (Editor-in-Chief)Copyright © 2013 by the American Psychological Association. All rights reserved.

C H A P T E R 7

GENERALITY AND GENERALIZATION OF RESEARCH FINDINGS

Marc N. Branch and Henry S. Pennypacker

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication. (Cohen, 1994, p. 997)

Confirmation comes from repetition. . . . Repetition is the basis for judging . . . sig-nificance and confidence. (Tukey, 1969, pp. 84–85)

As the general psychology research community becomes increasingly aware (e.g., Cohen, 1994; Lof-tus, 1991, 1996; Wilkinson & Task Force on Statisti-cal Inference, 1999) of the limitations of traditional group designs and statistical inference methods with regard to assessing reliability and generality of research findings, we present an alternative approach that has been substantially developed in the branch of psychology now known as behavior analysis. In this chapter, we outline how individual subject methods, that is, so-called single-case designs, provide straight-forward and, in principle, simple methods to assess the reliability and generality of research findings.

OVERVIEW

The chapter consists of three major sections. In the first, we summarize the limitations of tradi-tional methods, especially as they relate to assess-ing reliability and generality of research findings concerning behavior. We make the case that tradi-tional methods have obscured an important dis-tinction that has led to psychology’s consisting of

two related, but separable, subject matters, behav-ioral science and actuarial science. We also focus on the issue of generality across individuals and how traditional methods can give the illusion of such generality. In the second major section, we discuss dimensions of generality in addition to generality across individuals. Here we define scien-tific generality and several other forms of generality as well. In so doing, we introduce the roles of rep-lication, both direct and systematic, in assessing generality of research results. We argue that repli-cation, instead of statistical inference, is an alter-native primary method for determining not only the reliability of results but also for assessing and characterizing the generality of scientific findings. In the third major section, we discuss generaliza-tion of treatment effects, the fundamentals of tech-nology transfer, and the practices that characterize translational research. There, we write of program-ming for and assessment of generalizability of sci-entific findings to applied settings. We expand our view then to the engineering issues of technology development (or technology transfer and transla-tional research) as a capstone demonstration of generalization based on an understanding of gen-erality of research findings.

LIMITATIONS OF TRADITIONAL METHODS

The traditional group-mean, statistical-inference approach to analyzing research results has faced

Preparation of this chapter was supported by National Institute on Drug Abuse Grant DA004074.

Page 2: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

152

consistent criticism for more than 4 decades (e.g., Bakan, 1966; Carver, 1978; Cohen, 1994; Gigeren-zer, Krauss, & Vitouch, 2004; Loftus, 1991, 1996; Meehl, 1967, 1978; Nickerson, 2000; Rozeboom, 1960). Most of that criticism has focused on what those methods have to say about the reliability of research findings, which is appropriate because if findings are not reliable, there is no need to assess their generality. These methods, however, have also been criticized with respect to theory testing and development, issues that directly relate to generality. We treat these two categories of criticism separately.

Significance Testing and ReliabilityAfter all of the carefully reasoned criticism of signifi-cance testing that has been published, one would hope that a clear understanding of its limits would exist among professional psychologists. That, how-ever, appears not to be true, as noted by Cohen (1994), who lamented that

after 4 decades of severe criticism, the ritual of null hypothesis significance testing . . . still persists. [As does] near universal misinterpretation of p as the probability that H-sub-0 is false, [and] the misinterpretation that its comple-ment is the probability of successful rep-lication. (p. 997)

Cohen’s assertion is supported by survey evidence revealing that a substantial majority of academic research psychologists incorrectly interpret p values and statistical significance (Haller & Krauss, 2002; Kalinowski, Fidler, & Cumming, 2008; Oakes, 1986). That a significant proportion of professional psychologists do not appreciate what statistical significance and, especially, p values represent is apparent testimony to a weakness in the training of research psychologists, a failing that lies at the feet of those of us who are engaged in teaching them. In fact, Haller and Krauss (2002) included a sample of statistical methodology instructors in their study and found that 80% of them were mistaken in their understanding of p values, so it comes as less of a surprise that the misconceptions are widespread. The following discussion, therefore, is another attempt to make clear what a p value is and what it means.

A p value, which results from a significance test, is a conditional probability. Specifically, it is the probability, if the null hypothesis is true, of obtain-ing data of a particular sort. That is, in algebraic symbols, it is p = P(Data|H0). The important point is that p ≠ P(H0|Data), which is what a researcher would presumably really like to know. In other words, a p value does not provide quantitative infor-mation about whether the null hypothesis is true, which is apparently widely misunderstood. Because it does not provide the oft-assumed information about the likelihood of the null hypothesis being true, a p value of .01 does not mean that the proba-bility of the null hypothesis being true is 1 in 100. In fact, it conveys nothing quantitative about the truth of the null hypothesis. To see why, note that changing the order of conditionality in a condi-tional probability is crucially important. Consider such examples as P(Dead|Electrocuted) versus P(Electrocuted|Dead) or P(Cloudy|Raining) versus P(Raining|Cloudy). The first probability in each pair tells nothing about the second, just as P(Data|H0) reveals nothing about P(H0|Data). A p value, there-fore, has quantitative meaning only if the null hypothesis is true, but when performing statistical tests not only does one not know whether the null hypothesis is true, one probably assumes it is not. The important fact is that a finding of statistical sig-nificance, via a small p value, does not imply that the null hypothesis is unlikely to be true. The incor-rect logic underlying the mistaken conclusion (cf. Falk & Greenbaum, 1995) apparently goes as fol-lows: If the null hypothesis is true, data of a certain sort are unlikely. I obtained data of that sort, so therefore the null hypothesis is unlikely to be true. That so-called logic is precisely the same as the fol-lowing: If the next person I meet is an American, he or she is unlikely to be the President. I just met the President. Therefore, he or she is unlikely to be an American.

The fundamental misunderstanding of what a p value is leads directly to the more serious problem of assuming that it indicates something quantitative about the reliability, that is, the likelihood of repli-cation, of the finding. A common misunderstanding (see Haller & Krauss, 2002, and Oakes, 1986, for evidence) is that a p value, for example of .01, is the

Page 3: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

153

complement of the probability of replication should the experiment be repeated. That is, the mistaken assumption is that if one conducted the experiment 100 times, one should replicate the result on 99 of those occasions (at least on average). If one knew that the null hypotheses were true, then that would be a correct interpretation of the p value. Of course, though, one does not know whether H0 is true (again, one usually hopes it is not). In fact, one con-ducts the statistical test so that one can make what one (mistakenly) hopes is an educated guess about whether it is true. Thus, to say on the basis of a small p value that a result is statistically reliable is to strain the meaning of reliable beyond reasonable limits.

This limitation of statistical significance is not based on technical details of the null hypothesis. That is, the problem does not lie with whether the underlying distribution is formally normal or near normal or whether the statistical test involved is demonstrably robust with respect to violations of assumptions about the underlying distribution. The limitation is based in the logic of the approach. All the assumptions about the distributional character-istic null hypothesis might in fact be true, but that is not relevant when one is speaking of what a p value indicates.

A major limitation of statistical significance, therefore, is that it does not provide direct informa-tion about the reliability of research findings. With-out knowledge about reliability, no examination of generality can occur because repeatability is the most basic test of generality. Notwithstanding that limitation, however, significance testing based on group means may be seen, incorrectly, to have implications for generality of findings across sub-jects. Adherence to this view unfortunately gains strength as sample size increases. In fact, however, regardless of sample size, no information about intersubject generality can be extracted from a significance statement because no knowledge is afforded concerning the number of subjects for whom the effect actually occurred. We examine the implications of this fact in more detail below.

Aside from the limits surrounding reliability just described, other characteristics of group-mean data warrant examination as we move into a discussion

of generality. It is here that we show that psychol-ogy, presumably because of the widespread use of significance testing, has developed two distinguish-able subject matters.

Significance Testing and GeneralityTraditional significance testing approaches in psy-chology are generally based on data averaged across individuals. As is well known, the mean from a group of individuals (a sample) provides an estimate of the mean of the entire population from which the sample is drawn, and that estimate can be bounded by confidence intervals that provide information (not the probability, however, that the population mean falls within the interval; see Smithson, 2003) about how confident one can be that the population mean lies within such intervals. Thus, the sample mean provides information about a parameter that applies to the entire population. That fact appears to imply substantial generality; it applies to the entire population (however delimited), so generality appears maximized. This raises two important issues.

First is the question of representativeness of the means, both sample and population. That is, identi-cal or similar means can result from substantially different distributions of scores. Two examples that illustrate this fact are given in Figures 7.1 and 7.2. In Figure 7.1, four distributions of 20 scores are arrayed horizontally in the upper panel. In the top row, the values are arithmetically separated, whereas in the other three, they are clustered in various ways. Note that none of the four is particularly “nor-mal” in appearance, that is, clustered in the middle. The four plots in the lower panel show—with the top plot corresponding to the top distribution in the upper panel, and so on—the means (solid points) and standard deviations (bars) of the four distribu-tions. They are, as planned, identical. These data show that identical means and standard deviations, the stock in trade of inferential statistics, can be obtained from very different distributions of values. That is, in these cases the means and standard devia-tions do not provide a particularly informative or representative indication of what the individual val-ues are, which implies that when dealing with aver-ages of measures, or averages across individuals, attention must be paid to the representativeness of

Page 4: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

154

FIGURE 7.1. Upper panel: Four distributions of values, with each symbol representing one value on the x-axis. Lower panel: The corresponding means and standard deviations of the four corresponding distributions from the upper panel. From The Elements of Graphing Data (rev. ed., p. 215), by W. S. Cleveland, 1994, Summit, NJ: Hobart Press. Copyright 1994 by AT&T Bell Laboratories. Reprinted with permission.

X 0 5 10 15 20

0

2

4

6

8

10

12

14

X 0 5 10 15 20

0

2

4

6

8

10

12

14

X 0 5 10 15 20

0

2

4

6

8

10

12

14

X 0 5 10 15 20

Y

Y

Y

Y

0

2

4

6

8

10

12

14

FIGURE 7.2. Anscombe’s quartet. Each of the four graphs shows 11 x–y pairs and the best-fitting (least-squares estimate) straight line through the points. The slopes and intercepts of the lines are identical. From “Graphs in Statistical Analysis,” by F. J. Anscombe, 1973, American Statistician, 27, pp. 19–20. Copyright 1973 by the American Statistical Association. Adapted with permission. All rights reserved.

Page 5: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

155

the mean, not just its value, or even its standard deviation. Figure 7.2, which contains what is known as Anscombe’s quartet (Anscombe, 1973), provides an even more dramatic illustration of how focusing only on the average of a set of numbers can lead one to miss important features of that set. The four graphs in Figure 7.2 plot 11 values in x–y coordi-nates and show the best-fitting (via the least-squares method) straight line to the data. Obviously, the dis-tributions of points are quite different in the four sets. Nevertheless, the means for the x values are all the same, as are their standard deviations. The same is true for the y values (yielding eight instances of the sort shown in Figure 7.1). In addition, the slopes and intercepts of the straight lines are identical for all four sets, as are the sums of squared errors and sums of squared residuals. Thus, all four would yield the same correlation coefficient describing the relation between x and y.

The point of these illustrations is to indicate that a sample mean, even though a predictor of a population mean, is not necessarily a good descrip-tion of individual values, so it is not necessarily a good indicator of the generality across individual measures. When the measures come from individ-ual people (or other nonhuman animals), it follows that the average of the group may not reveal, and may well conceal, much about individuals. It is important to remember, therefore, that sample means from a group of individuals permit infer-ences about the population average, but these means do not permit inferences to individuals unless it is demonstrated that the mean is, in fact, representative of individuals. Surprisingly, it is rare in psychology to see the issue of representativeness of an average even mentioned, although recently, in the domain of randomized clinical trials, the limitations attendant to group averages have been gaining increased mention (e.g., Penston, 2005; Williams, 2010).

Many experimental designs, nevertheless, involve comparison across groups with large numbers of subjects, which raises the question of the practical-ity of presenting the data of every individual. The concern is legitimate, but the problem is not solved by resorting to the study of group averages only. Excellent techniques for comparing distributions,

like stem-and-leaf plots, box plots, and quantile–quantile plots, are available (Cleveland, 1994; Tukey, 1977). They provide a more complete description of measures from individuals, or a useful subset (as can be the case with quantile–quantile plots), than do simple means and standard errors or means and confidence intervals. We presume that as null-hypothesis significance-testing approaches become less prevalent, more effort will be directed toward developing new and better techniques for comparing distributions, methods that will include and make evident the measures from individuals.

Two Separable Subject Matters for Psychology?In some instances, the difference between a popula-tion parameter, such as the population average, and the activity of an individual is obvious. For example, consider the average rate of pregnancy in women between 20 and 30 years old. Suppose that rate is 7%. That, of course, is a useful statistic and can be used to predict how many women in that age cate-gory will be pregnant. More important for the pres-ent purposes, however, is that the value, 7%, applies to no individual woman. That is, no woman is 7% pregnant. A woman is either pregnant or she is not.

What of situations, however, in which an average is representative of the behavior of individuals? For example, suppose that a particular teaching tech-nique is discovered to result in a 10% increase in performance on some examination and that the improvement is at or near 10% for every individual. Is that not a case in which a group average would permit estimation of a population mean that is, in fact, a good descriptor of the effect of the training for individuals and, because it applies to the popula-tion, has wide generality? The answer is yes and no.

The point to be made here is somewhat subtle, and so we elaborate on it with an example. Consider a situation in which a scientist is trying to determine the relation between amount of practice at solving five-letter anagrams and subsequent speed at solving six-letter anagrams. Suppose, specifically, that no practice and 10, 50, 100, and 200 anagrams of prac-tice are to be compared. After the practice, subjects who have never previously solved anagrams, except for those seen in the practice phase, are given 50 new

Page 6: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

156

anagrams to solve, and the time to complete is recorded. Because total practice might be a determi-nant of speed, the scientist opts to use a between-groups design, with each group being exposed to one of the practice regimens. That is, the hope is to extract the seemingly pure relation between practice and later speed, uncontaminated by prior relevant practice. The scientist then averages the data from each group and uses those means to describe the function relating amount of practice to speed of solv-ing the new, more difficult anagrams. In an actual case, variability would likely be found among indi-viduals within each group, so one issue would be how representative the average is of each member of each group. For our example, however, assume that the average is representative, even perfectly so (i.e., every subject in a group gives exactly the same value). The scientist has generated a function, proba-bly one that describes an increase in speed of solving anagrams as a function of amount of prior practice. In our example, that function allows us to predict exactly what an individual would do if exposed to a certain amount of practice. Even though the means for each group are representative and therefore per-mit prediction about individual behavior, the impor-tant point is that the function has no meaning for an individual. That is, the function does not describe something that would occur for an individual because no individual can be exposed to different amounts of practice for the first time. The function is an actuarial account, not a description of a behavioral process. It is, of course, to the extent that the means are representative, a useful finding. It is just not descriptive of a behavioral process in an individual. To examine the same issue at the level of an individual would require investigation of sequences of amounts of practice, and that examination would have to include experiments that factor in the role of repeated practice. Obvi-ously, such an endeavor is considerably more com-plicated than the study that generated the actuarial curve, but it is the only way to develop a science of individual behavior. The ontogenetic roots of behavior cumulate over lifetimes. In later portions of this chapter, we discuss how the complications may be confronted.

The point is not to diminish the value of actuar-ial data, nor to suggest that psychologists abandon the collection and analysis of such data. If means are highly representative, such data can offer predic-tions at the individual subject level. Even if the means are not highly representative, organizations such as insurance companies and governments can and do make important use of such information in determining appropriate shared risk or regulatory policy, respectively. The point is, using insurance rates as an example, that just because you are in a particular group, for example, that of drivers between the ages of 16 and 25, for which the mean rate of accidents is higher than for another group, does not indicate that you personally are more likely to have an automobile accident. It does mean, how-ever, that for the insurance company to remain prof-itable, insurance rates need to be higher for all members of the group. Similarly, with respect to health policy, even though most people who smoke cigarettes do not get lung cancer, the incidence of lung cancer, on a relative basis, is substantially greater, on average, in that group. Because the group is large, even a low incidence rate yields a substan-tial number of actual lung cancer cases, so it is in the government’s, and the population’s, interest to reduce the number of people who smoke cigarettes.

The crux of the matter is that actuarial and behavioral data, although related in that the former depend on the latter, are distinguishable and, there-fore, should be distinguished. Psychology, to the extent that it relies solely on the methods of inferen-tial statistics that use averages across individuals, becomes an actuarial science, not a science of behav-ioral processes. The methods described in this chapter are aimed at including in psychology its oft-stated goal of being a science of behavior (or of the mind). Behavioral and inferred mental processes really make sense only at the level of the individual. (The same is true of physiology, which has become a rather exact science in part because of the influence of Claude Bernard, 1865/1957.) A person’s behavior, including thinking, imagining, and so forth, is par-ticular to that person. That is, people do not share their minds or their behavior with others, just as they do not share their physiology. A counterargument

Page 7: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

157

is that behavior and mental activity are too variable from individual to individual to permit a scientific analysis. We based this chapter on the more opti-mistic view that such activity is amenable to study at the level of the individual. Because a good deal of application of psychological knowledge involves dealing with individuals, for example, as in psycho-therapy, understanding at the level of the individual remains a worthy goal. Support for the viewpoint that a science of individual behavior is possible, however, requires an elaboration of how an individ-ual subject–based analysis can yield information that is more generally applicable to more than one or a few individuals.

Why Single-Case Designs Do Not Mean That N = 1Traditional approaches, with the attendant limita-tions described thus far, likely arose, at least in part, because of a legitimate concern about focusing research on individual subjects who are studied repeatedly through time (more on this later). Such research is usually performed with relatively few subjects, leaving open the possibility that effects seen might be limited with respect to generality across other individuals. An example, modeled after one offered by Sidman (1960), provides a response to such misgivings. Suppose we were interested in whether listening to classical music while solving arithmetic problems improves accuracy. Using a single-case approach, the study is started with a single subject. We might first establish a baseline of accu-racy (more on this later) by measuring it over several successive exposures. Next, we would test the sub-ject with the music present and then with it absent. Suppose we find that accuracy is increased when music is present and reverts to normal when it is not. Suppose also that unbeknownst to us, the effect music will have depends on the baseline level of accuracy; if accuracy is initially low, it is enhanced by the presence of music, whereas if it is initially high, it is reduced when the music is on. We might mistakenly conclude, on the basis of the results from the one subject, that music increases accuracy of solving the kinds of arithmetic prob-lems used.

Let us compare how a more traditional between-groups approach might fare in dealing with the issue. We apply music to one group and not to another. What will result will depend on the distri-bution of baseline accuracy across individuals. Figure 7.3 shows three possible population distribu-tions. In B, most people have low accuracy, in C most have high accuracy, and in A people fall into two groups with respect to baseline accuracy. If one performed the experiment on groups and took the group mean to be the indicator of the effect of the independent variable, the conclusion would depend on the underlying distribution. In A, the conclusion

FIGURE 7.3. Three hypothetical frequency distribu-tions characterizing the number of people display-ing different baseline rates. From Tactics of Scientific Research: Evaluating Experimental Data in Psychology (p. 149), by M. Sidman, 1960, New York, NY: Basic Books. Copyright 1988 by Murray Sidman. Reprinted with permission.

Page 8: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

158

might well be that music has no effect, with the low-ered accuracy in people with high baseline accuracy canceling out the increases that result among those with low baseline accuracy. If the population is dis-tributed as in B, the conclusion would be that music increases accuracy because the mean would move in the direction of improved accuracy. The important point is that simply considering the group average makes it less likely that the baseline dependency that underlies the effect will be seen.

Let us now compare what might transpire with the single-case approach, an approach based on rep-lication. Having seen the effect in the first subject, we recruit a second and do the experiment again. Suppose that the population distribution is as depicted in Figure 7.3B. The most likely scenario is that the second subject will also have low baseline accuracy because someone sampled from the popu-lation is most likely to manifest modal characteris-tics. We get the same result and could, mistakenly, conclude that music enhances arithmetic accuracy. That is, we make the same mistake as with the group-average approach. The difference between the two approaches, however, is that the group mean approach makes it more difficult to discover the underlying, real effect. The single-case approach, however, if enough replications are done, will even-tually and inevitably reveal the problem because sooner or later someone with high baseline accuracy will be examined and show a decrease. A key phrase in the previous sentence is “if enough replications are done.” Whether that happens is likely to depend on the perceived importance of the effect. If it is deemed important, it is likely to be subjected to additional research, which will, in turn, lead to addi-tional replications. Thus, the single-case approach is not some sort of panacea with respect to identifying such relations, but it offers a direct path to correc-tive action. Of course, it is possible to ferret out the baseline dependency using a group-mean approach, but that will happen only if attention is paid to the data of individual subjects in a group. In the single-case approach, those data are automatically scruti-nized. A major point is that single case does not necessarily imply that only one or even only a few subjects be examined. Some research questions might involve examination of many subjects. (We

discuss later how to decide how many subjects to test.) What the approach involves is studying each subject essentially as an independent experiment. Generality across subjects is therefore examined directly by seeing how often the experiment’s effects are replicated. A second major point is that the apparent virtues of studying many subjects, a stan-dard aspect of traditional research designs in psy-chology, are realized only if the data from each subject are individually analyzed.

Null-Hypothesis Significance Testing and Theory DevelopmentA major goal in any science is the development of theory, and there is a sense in which theory has clear relevance to generality. Effective theories are those that account for a wide array of research results. That is, they apply generally. The way in which signifi-cance testing is most commonly used in psychology, however, mitigates against the orderly development and testing of theories and against the analysis of competing theories. The problem was first identified as a paradox by Meehl (1967; see also Meehl, 1978). The problem is a logical one based largely on the choice of the null hypothesis as “no effect.” The logic of the common approach is as follows. An investiga-tor has a hypothesis that imposition of a variable, X, will change another measure, Y. This hypothesis is sometimes called the alternative hypothesis. The null hypothesis is then chosen to be that X will not change Y, that is, that it will be without effect. Next, the X condition is imposed, and Y is measured. A comparison is then made of Y without X and Y with X. A statistic is then calculated that is generally a ratio of changes in Y as a result of X over changes in Y as a result of anything else. In more technical terms, the statistic is effect variance over error vari-ance. The larger the statistic, the smaller the p value, and the more likely it is that statistical significance is achieved and the null hypothesis rejected. Standard teaching demands that even though one can decide to reject the null hypothesis, logic prevents one from accepting the alternative hypothesis. Instead, one would say that if the null hypothesis is rejected, the alternative hypothesis gains support.

The paradox noted by Meehl (1967) arises from the nature of the statistic itself. The size of the

Page 9: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

159

statistic is controlled by two values, the effect size and the error variance, so it can be increased in two ways. The way of interest for this discussion is via a decrease in error variance, the denominator. A major way of decreasing error variance is through increased experimental rigor (one avenue of which is to increase the number of subjects). To the degree that extraneous variables (the “anything else” men-tioned earlier) can be eliminated or held constant, error variance should decrease, making it more likely that that statistic will be large enough to war-rant a decision as to statistical significance. The paradox, therefore, is that as experimental rigor is increased—that is, as experimental techniques are refined and improved—statistical significance becomes more likely, with the consequence that the alternative hypothesis gains support, no matter what the alternative hypothesis is. That does not seem like a recipe for cumulative progress in science. Sim-ple null-hypothesis significance testing with the null hypothesis set at no effect cannot, by itself, help to develop theory.

Meehl (1967) described one approach that can obviate this paradox, which is to use significance testing with a null hypothesis that is not “no effect.” Instead, the null hypothesis is what the theory (or alternative hypothesis) predicts. Consider how the logic operates when this tactic is used. As experi-mental rigor increases, error variance is decreased, making it more likely that the resulting statistic will reach a critical value. When that value is achieved, the null hypothesis is rejected, but in this case it is the investigator’s theory that is rejected. Rather than increased experimental rigor resulting in its being easier for one’s theory to gain support, it results in its being easier to reject one’s theory. Increasing experimental control puts the theory to a more rig-orous test, not an easier one as is the case when using the no-effect, or no-difference, null hypothe-sis. The harder one works to reject a theory and fails to succeed, the more confidence one has in the theory.

Training in statistical inference, at least for psy-chologists, does not usually emphasize that the null hypothesis need not be no effect. It can, neverthe-less, as just noted, be some particular effect. Note that it has to be some specific value other than zero.

The use of a particular value as the null hypothesis therefore requires that one’s theory be quantitative enough to generate a specific value. This approach is what characterizes tests of goodness of fit (those that use significance tests) of quantitatively specified functions.

This approach of setting the null hypothesis at a value predicted by theory is nevertheless not immune to the previously described weaknesses of significance testing in general. If, however, signifi-cance testing is used to make decisions, at least this latter approach does not suffer from the weakness of making it easier to support a researcher’s theory, regardless of what it is, as methods improve.

In this section of the chapter, we have made the case, we hope, that commonly used psychology research methods have limitations in assessing reli-ability and generality of research findings. In addi-tion, the methods have resulted in many areas of psychology being largely actuarial, group- average–focused science rather than aimed at the behavior of individuals. In the next section, we describe the basics of an alternative approach that is based on replication rather than significance testing and group averages. It is useful to remember that impor-tant science was conducted before the invention of significance testing, and what follows is a descrip-tion of the application of methods used to establish most of modern physics and chemistry (and physiol-ogy) to the study of behavior. The approach focuses on understanding behavioral processes, rather than actuarial ones, and has already yielded a good deal of success, as other chapters in Volume 2 of this handbook illustrate. We should note, nevertheless, that even if the goal is actuarial prediction and influ-ence, the methods of statistical inference are limited in what they can achieve with respect to reliability of research findings. As we argue, the only sure way to examine reliability of results is to repeat them, so replication is the key strategy for both subject mat-ters of psychology.

ASSESSING RELIABILITY AND GENERALITY VIA REPLICATION

The two distinguishable categories of replication are direct replication and systematic replication,

Page 10: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

160

although, as we show, the distinction is not a sharp one. Most researchers are familiar with the concept of direct replication, which refers to repeating an experiment as exactly as possible. If the results are the same or similar enough, the initial effect is said to be replicated. Direct replication, therefore, is mainly used to assess the reliability of a research finding, but as we show, there is a sense in which it also provides information about generality. System-atic replication is the designation for a repetition of the experiment with something altered to see whether the effect can be observed in changed cir-cumstances. If the results are replicated, then the generality of the finding is extended to the new cir-cumstances. Many varieties of systematic replication exist, and it is the strategy most relevant to examin-ing the generality of research findings.

Direct Replication: Within-Subject Reliability and BaselinesIn the first part of this section, we describe the methods and roles of direct replication with the same experimental subject (i.e., a truly single-case experiment). We open with this simplest case, and with an example, not only to illustrate how the strat-egy can be used, but also to deal more clearly with reservations about and limitations of the approach as well as how decisions about characteristics of the replicative process may be made.

For our example, suppose that we want to mea-sure the amount of a certain kind of food eaten after some period without food. We let our subject eat after 12 hours of fasting; suppose that she eats 250 grams. Direct replication of this observation would require that we do the same test again. One possible, but unlikely, result would be that she would eat 250 grams again, providing an exact repli-cation. The amount eaten would more likely be slightly different, say 245 grams. We might then conduct another replication to see whether the trend toward eating less was replicable. Suppose on that occasion our subject eats 257 grams, making it less likely that there is a trend toward less ingestion with successive tests. We could repeat the process again and again. By repeatedly observing the amount eaten after a 12-hour fast, we gain more confidence with each successive measurement about how much our

subject will eat of that particular food after 12 hours of not eating.

One thing that direct replication can provide, via a sequence of direct, intrasubject replications such as that just described, is a baseline. The left segment of Figure 7.4 shows that there appears to be a steady baseline amount of intake in our example. A ques-tion that might arise is how many observations are needed to establish a baseline, that is, to come up with a convincing assessment? The answer is that it depends. There is no rule or convention about how many replications are needed to render an outcome considered reliable in the eyes of the scientific com-munity. One factor of importance is how much is already known. In some of the more advanced physical sciences, a single replication (usually by a different research team) might be adequate. In our example, the researcher might have conducted simi-lar research previously, discovered that the baseline value does not change after 10 observations, and thus deemed 10 replications enough. The researcher who chooses replication as a strategy to determine reliability of findings, therefore, does not have the comfort of a set of conventions (akin to those avail-able to investigators who use conventional levels of statistical significance) to decide whether to conclude if an effect is reliable enough to warrant reporting to the scientific community. Instead, the investigator’s judgment plays a role, and his or her scientific reputation is dependent to some degree on

Successive Tests0 2 4 6 8 10 12 14 16 18 20 22 24 26

Gra

ms E

aten

0

50

100

150

200

250

300

Baseline - Food 1 Food 2 Food 1

FIGURE 7.4. Hypothetical data from a series of obser-vations of eating. The first 10 points and last six points are amounts eaten of Food 1. The middle six points are amounts eaten of Food 2.

Page 11: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

161

that judgment. One of the comforts of a set of con-ventions is that if a researcher abides by them and results are later discovered, via failed attempts at replication, not to be reliable, that researcher’s repu-tation suffers little. In contrast, one can argue that there are both advantages and disadvantages to rely-ing on replication. Important advantages are having the benefit of informed judgment, especially of a seasoned investigator, and the fact that social pres-sure rides more directly on the researcher’s reputa-tion. The disadvantage comes from the lack of an agreed-on set of conventions. Principled arguments about which is better for science can be made for both positions, but we favor the view that science, as a social–behavioral activity, will fare better, or at least no worse, if researchers are held more account-able for their conclusions about reliability and generality than for their adherence to a set of arbi-trary, often misunderstood conventions.

Returning to the role of a baseline construed as a set of intrasubject replications, such baselines can serve as bases of comparison for effects of experi-mental changes. For example, after establishing a baseline of eating of the first food, we could change what the food is, perhaps less tasty or more or less calorie laden. The second set of points in Figure 7.4, which in essence depict measures from a second set of replications, have been chosen to indicate a decrease. The reliability of the effect is illustrated by the successive similarity of values, and judgments about how many replications are needed would be based on the same sorts of considerations as involved in the original baseline. A usual check would involve return to the original food, and the third set of points indicates a likely result, once again with a series of replications. The overall exper-iment, therefore, is an example of the ubiquitous A-B-A design (see Chapter 1, this volume).

Replication, of course, need not refer only to a series of successive measurements under identical conditions to produce a baseline. If the type of find-ing summarized in Figure 7.4 were especially coun-terintuitive or at considerable odds with existing knowledge, one might well repeat the entire project, Food 1 to Food 2 to Food 1, and that, too, would constitute a direct intrasubject replication. In fact, the entire project could be carried out multiple

times if, in the investigator’s judgment, such confir-mation was necessary. Each successful replication increases confidence that the independent variable, change of food type, is responsible for the change in eating.

Direct Replication: Between-Subjects Reliability and GeneralityAfter all this work, an immediate limitation is that the findings, so far as we know, may well apply only to the one person studied. Our first result is based on intrasubject replication. If the goal of the research was to see whether the change in food can influence eating, then it may be the case that no fur-ther replication is needed. It is likely, however, that our interest extends beyond what is possible to what is common. In that case, additional replication is in order, which brings us to the next type of direct rep-lication, replication with different subjects, or inter-subject replication. Intersubject replication is used to examine generality, in this case across subjects, and in this single-case design N is extended to more than 1. Intersubject replication makes clear the fuzz-iness of the distinction between direct and system-atic replication. The latter is generally defined as a replication with something changed (see below), and a new subject is certainly a change. We also sug-gest that systematic replication is a main strategy for assessing generality, and by studying a second sub-ject, generality across individuals is on trial. It is even possible to suggest that most replications, even intrasubject replications, are, in fact, systematic. For example, in the intrasubject replication described above, time is different for successive observations, and the subject brings a different history to each observation period. It nevertheless has become stan-dard to characterize replications in which the proce-dures are essentially the same as direct replications. As we outline shortly, systematic replications are characterized by changes in procedure or conditions that can be quite substantial.

As noted in the section Significance Testing and Generality earlier in this chapter, an emphasis on replication with individual subjects approaches the issue of subject generality by increasing the number of subjects studied. Suppose, for the sake of our example, we study a second subject, performing the

Page 12: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

162

entire experiment, baseline, new food, baseline, and the whole sequence, over again. There are two major classes of outcomes. One, we get the same effect. Two, we do not. Let us deal initially with the former possibility. The first issue is what we would accept as “same.” The second person’s baseline level would likely not be exactly the same, and in fact, it might be notably different, averaging, say, 200 grams. Should we count that as a failure to replicate? The answer is (again), it depends. If our major concern was the exact amount eaten and the factors contrib-uting to that, then the result might well be consid-ered a failure to replicate. We will hold off for a bit, however, on what to do in the face of such failures, and move forward on the assumption that we are not so much concerned with the exact amount eaten as with whether the change in food results in a change in amount eaten. In that case, we might rep-licate, with the second subject, the whole sequence of conditions, Food 1, Food 2, and back to Food 1. Two possibilities exist: The results are the same as for the first subject or they are not, and again, conse-quently, an important issue is what is meant by same. The results are unlikely, especially in behav-ioral science, to be identical quantitatively, and, in fact, if the baseline is different, the change in intake cannot be identical in both absolute and rela-tive terms, so we are left to decide whether to focus on what is different or on what is similar. In this stage of the discussion, let us assume that intake decreased, as it had for the first subject. In that case, we might feel confident that an important feature of the data has been replicated. A next question, then, would be whether additional replication with other subjects is needed. In this particular example, the answer would most likely be yes, but as is generally the case, the real answer is that it depends on what the goals of the experiment are.

Behavioral scientists, by and large, tend to focus on similarities rather than differences, so if features of data reveal similarity across individuals, those similarities are likely to be pursued. Consider, there-fore, a situation in which the data for the second subject are dissimilar, not only in quantitative terms but in qualitative ones as well. For example, sup-pose that for the second subject the change from Food 1 to Food 2 results in an increase in amount

eaten rather than a decrease. Here, there is no ques-tion that an important aspect of the first result has not been replicated. What is to be done then? The answer lies in the assumption of determinism that is at the core of behavioral science. If there is a differ-ence observed between Subject 1 and Subject 2, that difference is the result of some other influence. That is, people do not differ for no reason. In fact, the failure to replicate the exact intake levels at baseline must also be a result of some factor. Failure to repli-cate, therefore, is an occasion on which to initiate a search for the variable or variables responsible for the differences in outcomes. Suppose, for example, that Subject 1 was female, and Subject 2 was male. Tests with other men and women (note the expan-sion of N) could reveal whether this factor was important in determining the outcome. Similarly, we have already assumed different baseline levels, so it might be the case that baseline level is related to the direction of change in intake, a hypothesis that can be examined by studying additional subjects. It is interesting that examination of this second possi-bility could be aided if the issue of different base-lines between Subject 1 and Subject 2 had been assumed to be a failure to replicate. In that case, we would have focused on reasons for the difference and may have identified factors that determine base-line level. If that were so, it might be possible to control the baseline levels and to change them sys-tematically, thus providing a direct method for studying the relation between baseline level and the effect of changing the food.

Another possible reason that disparate effects are observed between subjects is differing sensitivity to the particular value of the independent variable used. In the example just described, the indepen-dent variable was characterized qualitatively as a change in food type, making assessment of sensitiv-ity to it difficult to assess. If, however, the indepen-dent variable can be characterized quantitatively, for instance by carbohydrate content in our example, the technique of systematic replication, elaborated below, can be used to examine the possibility.

An important issue in considering direct replica-tion arises when intersubject replication succeeds but intrasubject replication does not. Taking our example, suppose that when the conditions were

Page 13: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

163

changed back to Food 1 with our first subject (cf. Figure 7.4), eating remained at the lower level, which would prevent replication of the effect in Subject 1. Such a result indicates either that some variable other than the change of food was responsi-ble for the decrease in eating or that the exposure to Food 2 has produced a long-lasting change in eat-ing. Support for the second view can come from attempts at intersubject replication. If experiments with subsequent subjects reveal that a shift from Food 1 to Food 2 results in a relatively permanent decrease in eating, the effect is verified.

When initial effects are not recaptured after intervening experience that produces a change, the change is said to be irreversible. Using replication to examine irreversible effects requires intersubject replication, so we have here another instance in which N = 1 does not mean that only one subject need be studied. Many effects in psychology are irre-versible, for example, those that we call learning, so the individual subject approach requires that inter-subject replication be used to assess the reliability of such effects, and in so doing the generality of the effect across subjects is automatically examined.

A focus on each subject individually, of course, does not prevent the use of traditional data analysis approaches, should an investigator be so inclined (for inferential statistical analyses appropriate to single-case research designs, see Chapters 11 and 12, this volume). Some, for example, might want to present group averages so that actuarial predictions can be made. Standard techniques can be used sim-ply by engaging in the usual sorts of data manipula-tion. An emphasis on the data from individuals, nevertheless, can be used to enhance the presenta-tion. For example, consider a study by Dunn, Sigmon, Thomas, Heil, and Higgins (2008), who compared two conditions aimed at reducing ciga-rette smoking. In one, vouchers were given contin-gent on breath samples that indicated that no smoking had occurred, whereas in the other the vouchers were given independently of whether the subject had smoked. Figure 7.5 shows some of the results. The bars show group means, and the dots show data from each individual, illustrating the degree to which effects were replicable across patients and the representativeness of the group

averages. Such a display of data provides consider-ably more useful information than do presentations that include only means or results of tests of statisti-cal significance.

Systematic Replication: Parametric ExperimentsTo this point, our emphasis has been on the intra- and intersubject generality and reliability of effects, and we have argued that individual subject approaches can be effectively used to assess it. Gen-erality of effects, however, is not limited to general-ity across individuals, and it is to other forms of generality, culminating with scientific generality, to which we now turn.

As noted earlier, systematic replication refers to replication with something changed, and, as also noted, a case can be made that replication with a new subject is a form of systematic replication in

FIGURE 7.5. Number of days of continuous absti-nence from smoking cigarettes in two groups of sub-jects. Circles are data from individuals. Open bars and brackets show the group means and standard errors of those means. Subjects represented by the left bar received vouchers contingent on abstinence, whereas those represented by the right bar received vouchers independent of their behavior. The top bracket and asterisk indicate that the mean difference was statisti-cally significant at the .01 level. From “Voucher-Based Contingent Reinforcement of Smoking Abstinence Among Methadone-Maintained Patients: A Pilot Study,” by K. E. Dunn, S. C. Sigmon, C. S. Thomas, S. H. Heil, and S. T. Higgins, 2008, Journal of Applied Behavior Analysis, 41, p. 533. Copyright 2008 by the Society for the Experimental Analysis of Behavior, Inc. Reprinted with permission.

Page 14: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

164

that it is an experiment with something changed, namely the experimental subject. From such replica-tions come assessments of the across-subject gener-ality of effects. In this section, we discuss other sorts of changes between experiments that constitute sys-tematic replication. To do so, let us begin again with our example of effects of food type on eating. Sup-pose that after obtaining the data in Figure 7.4, we perform a systematic replication of the study rather than a direct repetition. For example, we might notice that Food 2’s carbohydrate content is higher than that of Food 1. We decide, therefore, to alter the carbohydrate content of Food 2 (and let us assume, likely impossible, without changing the taste) so that it matches that of Food 1, and repeat the experiment. Such an experiment would examine the generality of Food 2’s effect on eating to a new carbohydrate level. If adjusting Food 2’s carbohy-drate amount to equal that of Food 1 resulted in the switch in foods having no effect on eating, two things can be concluded. One, the original result was not replicated. In such cases, it is often wise to replicate the original experiment to determine whether unknown variables might have been responsible. Two, carbohydrate amount is identified as a likely important variable. Thus, systematic rep-lication is not only a method for discovering gener-ality of effects, it is also an approach that can lead to finding controlling variables.

Continuing our description of types of systematic replication, let us assume we decide to examine more fully the role of carbohydrates in eating. Our original experiment may be conducted several times but with a different carbohydrate mix for Food 2 on each occasion. Each repetition of the experiment, then, constitutes a systematic replication because a new value of carbohydrate is used for each instance. Experiments that systematically vary the value of a variable are called parametric experiments, and they play an especially important role in assessing gener-ality. Consider the data in Figure 7.6, which are constructed to emulate what might result if several intersubject replications of a parametric experiment were conducted.

Parametric examination provides a number of advantages when assessing the reliability and gener-ality of results. First, had only a single value of the

independent variable been assessed, we might have been less than impressed with the degree of inter-subject replicability of the data. The results of para-metric examination, however, reveal a good deal of similarity across the three subjects: All show the same basic relation. At low percentages, the amount eaten is roughly constant within each individual. As the percentage increases, the amount eaten decreases until the percentage reaches a value above which further increases are associated with no changes in amount eaten. Second, and this is a key characteristic of parametric evaluation, the data sug-gest that only a range of levels of the independent variable result in a change in behavior. That is, para-metric experiments permit the identification of boundary conditions, or limiting conditions, outside of which a variable is relatively ineffective. As we show later when dealing with the issue of scientific generality, information about boundary conditions can be extremely important.

Figure 7.6 also illustrates how parametric experi-ments can help deal with the problem of lack of intersubject replicability when a single value of an independent variable is examined. Recalling our original example of comparison of food types, con-sider what could have happened if our first two sub-jects were Subjects 1 and 3 of Figure 7.6 and Food 1 had contained 20% carbohydrate and Food 2 had contained 25%. Changing the food type would have produced a change for Subject 1 but not for Subject 3,

FIGURE 7.6. Hypothetical data for three subjects showing the relationship between carbohydrate content and amount eaten.

Page 15: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

165

leading to a conclusion that we had failed to repli-cate the food change effect across subjects. The parametric examination, however, shows that both subjects are similar in how food intake was influ-enced by carbohydrate content, except that behavior of the two subjects was sensitive in a slightly differ-ent range. One of the most satisfying outcomes of parametric experiments is when they reveal similari-ties that cannot be judged when only a single value of an independent variable is tested.

It is worth noting, too, that parametric experi-ments can reveal that apparent intersubject replica-bility can be misleading regarding how a variable influences behavior. It is possible that tests with a single value of an independent variable might lead to very similar quantitative results for several sub-jects, whereas a parametric analysis reveals that very different functions describing the relation between the independent variable happen to cross or come close together at the particular value of the indepen-dent variable evaluated.

Parametric experiments illustrate one of the strengths of being able to characterize independent variables quantitatively. Experiments that determine how much of this yields how much of that provide more information about generality than do experi-ments that simply test whether a particular value of an independent variable has an effect. They can identify similarity where none is evident with a sin-gle value of an independent variable, and they can also determine whether apparent similarity is unrepresentative.

We should note that parametric experiments are not limited in application to only primary indepen-dent variables, such as that shown in our fictitious example. Any variable associated with an experiment can be systematically varied. As an example, the experiment just described could be conducted under a range of temperatures, a range of degrees of hydra-tion of the subjects, a range of times without food before the test, and any of several other variables. Those experiments, too, would provide information about the range of conditions under which the inde-pendent variable of carbohydrate content exerts its effects in the circumstances of the experiment.

Parametric experiments, although very important, are not the only kind of systematic replications. One

other type involves using earlier findings as a starting point, or baseline, for examination of other variables. As an example, consider the phenomenon of false memory in the laboratory, produced by a procedure originally developed by Deese (1959) and later elabo-rated by Roediger and McDermott (1995). In these studies, subjects said they recalled or recognized words that were not presented. A great deal of research followed the original demonstrations, and these experiments varied procedural details, measure-ment techniques, subject characteristics, and so forth. In each instance, therefore, in which the false memory effect was reproduced, the reliability of the phenome-non was demonstrated and its generality extended. Using the reproduction of previous findings as a start-ing point for subsequent research, therefore, is a use-ful and productive technique for examining reliability and generality of research outcomes.

Sidman (1960), in his characterization of tech-niques of systematic replication, described a type he called “systematic replication by affirming the conse-quent” (p. 127). Essentially, this approach is very similar to the idea of hypothesis testing because the systematic replication is not based on simply chang-ing some aspect of the experiment to see whether effects can still be reproduced but rather on what the investigator sees to be the implications of previous results. That is, the replication may be based on the investigator’s interpretation of what the data mean.

For example, consider our fictitious study of the effects of carbohydrate content on eating. That result, and perhaps those of other experiments, might suggest that the phenomenon is not specific to eating. Carbohydrate ingestion possibly leads to general lethargy or low motivation for voluntary behavior. If we suspect that, we might devise other experiments that could be viewed as systematic rep-lications based on the possible implications of the previous findings. If the results were consistent with the lethargy interpretation, the view would gain in credence; if they were not, the view might well be abandoned. As Sidman (1960) noted, definite con-clusions may not be drawn from successful replica-tions by affirming the consequent, but, as he also noted, the approach is essential to science. The degree to which one’s confidence in an interpreta-tion of data grows with successful replications

Page 16: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

166

depends on many things, not the least of which is how counterintuitive the predicted outcome is.

Types of Generality Assessed and Established by Systematic ReplicationJohnston and Pennypacker (2009) offered a useful characterization of the dimensions along which gen-erality can be examined. They initially suggested a dichotomy between “generality of” and “generality across.” Generality across is simple to understand. As we have already noted, replication can be used to determine generality across subjects or situations, a type of generality usually of considerable interest. Systematic replication comes to the fore in the assessment of generality across species and across settings. By definition, systematic replication is an attempt at replication with something different, so if the species is changed, or if something (or a lot) about the setting is altered, the replication attempt is a systematic one. In both cases, the issue of what constitutes a successful replication may arise. Con-sider, for example, if we decided to attempt a cross-species replication of our experiments with food types, and our new species was a mouse. Obviously, mice would eat considerably less, and therefore a precise, quantitative replication would not be possi-ble. We might (actually, probably would), however, argue that the replication was successful if the rela-tion between carbohydrate content and eating was replicated, that is, if at low concentrations there was little effect on eating, but as carbohydrate content increased, the amount eaten decreased until some level is reached above which further decreases were not seen (cf. Figure 7.6).

What if the content values at which the decreases begin and end differ between the species? For exam-ple, mice may begin to show a decline when the food reaches 15% carbohydrate, whereas with the humans, decreases are not evident until the food contains 25% carbohydrate. Is that a failure to repli-cate? Again, the answer is yes and no. The business of science is to find regularities in nature, so empha-sis is properly placed on similarities. Differences virtually always exist, so they are easy to find. Nev-ertheless, they cannot be ignored entirely, but their main role is not to indicate that the similarities evident are somehow unimportant, but rather to

promote further research into the origins of the dif-ferences if the differences are judged to be impor-tant. The scientist and the scientific community make judgments about the need for further investi-gation of the differences that are always present in replications.

Generality of also plays an essential role in sci-ence. Johnston and Pennypacker (2009) described several categories of generality of, but here we focus on one in hopes of making the concept clear: gener-ality of process. Our example is a behavioral process familiar to most psychologists, specifically the pro-cess of reinforcement of operant (purposive) behav-ior. Reinforcement refers to the increase in likelihood of behavior as a result of earlier instances being fol-lowed by certain consequences, which is the pro-cess. Systematic replications across an immense range of both behavioral activities and a very large range of consequences have been shown to provide instances of the process. For example, in addition to the traditional lever press and key peck, activities ranging from the electrical activity of an impercepti-ble movement of the thumb (de Hefferline, Keenan, & Harford, 1959), to vocal responses of chicks (Lane, 1960), to generalized imitation in children with developmental delays (Baer, Peterson, & Sher-man, 1967), to the extensive range of activities described in the use of reinforcement in the treat-ment of behavior disorders (e.g., Martin & Pear, 2007; Ullman & Krasner, 1966) have all been shown as instances of the process. Similarly, the range of events used as effective consequences to produce reinforcement is also broad. Consequences such as praise, food, intravenous drug administration, open-ing a window, reducing a loud noise, access to exer-cise, and many, many others have been effectively used to produce reinforcement. All the reports may be viewed as describing systematic replications of the earliest experiments on the process (e.g., Skinner, 1932; Thorndike, 1898).

This generality of process is what stands as the rationale for speaking of reinforcement theory. The argument is similar to that offered for the motion of objects. Whatever those objects are, and whether they are falling, floating, being ballistically pro-jected, or orbiting in outer space, they can be sub-sumed under the notion of gravitational attraction,

Page 17: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

167

Newton’s theory of gravity. An even more dramatic example is provided by living things. All manner of plants and animals populate the earth, and their dif-ferences are obvious and virtually countless. What is less obvious but explains the variety is that all life can be considered to have developed from the opera-tion of three processes: variation, selection, and retention (Darwin, 1859). The sameness of cellular architecture, including nuclear material (e.g., DNA and RNA), also attests to the similarity. Likewise, all the myriad instances of reinforcement suggest that considering them instances of a single process is rea-sonable. As noted earlier, an important goal of sci-ence is to discover uniformities. In fact, as Duhem (1954) noted, one of the key features of explanation is identification of the like in the unlike. Objects look different, are made of different substances, and may or may not be moving in variety of ways, but they are similar in how they are affected by gravity. Behavioral activities take on many forms, and as just noted, so can the consequences of those activities. Nevertheless, they can (on many occasions) exhibit the phenomenon known theoretically as reinforce-ment, an instance of generality of process.

Scientific GeneralityAnother extremely important concept is scientific generality, a type of generality that has some coun-terintuitive characteristics. Scientific generality is important for at least two reasons. One, scientific generality speaks to scientists’ ability to reproduce their own findings and those of other scientists, as well. Two, scientific generality speaks directly to the possibility of effective application and translation of laboratory findings to the world at large, as dis-cussed more fully later in the last section of this chapter. Scientific generality is defined by knowl-edgeable reproducibility. That is, it is not character-ized in terms of breadth of applicability, but instead in terms of identification of factors that are required for a phenomenon to occur. To illustrate the differ-ence between scientific generality and, for example, generality across people, consider again the fictitious experiment on food types. Suppose that the original experiments were all performed with male subjects. On an attempt at replication with female subjects, it is discovered that food type, or carbohydrate

composition, has no effect at all on eating. That, of course, would be clear indication of a limit to the across-subjects generality of the effect on eating. It would, however, represent an increase in scientific generality because it specifies more clearly the con-ditions required to produce the phenomenon of reduced food intake. As stated by Johnston and Pennypacker (2009), “A procedure can be quite valuable even though it is effective under a narrow range of conditions, as long as we know what those conditions are” (pp. 343–344). The vital role that systematic replication, and even failures of system-atic replication, can play in establishing scientific generality therefore becomes evident. Scientific gen-erality represents an understanding of the variables responsible for a phenomenon.

GENERALIZATION, TECHNOLOGY TRANSFER, AND TRANSLATIONAL RESEARCH

The function of any science is the acquisition of basic knowledge. A secondary benefit is often the possibility of applying that knowledge in ways that impart benefit to some element of the culture at large. For example, Galileo’s basic astronomic obser-vations eventually led to improved navigation proce-dures with attendant benefits to the colonial powers of 17th-century Europe. Pasteur’s discovery in 1863 of the microorganisms that sour wine and turn it into vinegar, and the observation that heat would kill them, led eventually to the germ theory of dis-ease and the development of vaccines.

In the case of behavior analysis, a relatively young science, sufficient basic knowledge has been acquired to permit vigorous attempts at application. A discipline known as applied behavior analysis, discussed extensively elsewhere in Volume 2 of this handbook, is the primary result of these efforts, although application of the findings of behavior analysis are to be found in a variety of other disci-plines including medicine, education, and manage-ment, to name but a few.

In this section, we describe issues surrounding attempts to apply laboratory research findings in the wider world at large. Specifically, we discuss topics related to applying research findings from controlled

Page 18: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

168

laboratory or therapeutic settings to new situations or less controlled environments. First, we describe the issue of generalization of behavioral treatment effects from treatment settings to real-world circum-stances. Then we outline basic general strategies for effective transfer of technologies, taking into account the known scientific generality of behav-ioral processes. Finally, we offer comments on the notion of translational research, a matter of much contemporary interest.

Generalization of ApplicationsOne of the earliest subjects of discussion that arose with the development of behavior therapy and behavior modification techniques was the issue referred to as generalization (e.g., Yates, 1970). Spe-cifically, there was concern about whether improve-ments produced in a therapy setting would also appear in other, nontherapy (e.g., everyday life) sit-uations. The term generalization was borrowed from a core behavioral process discovered by experimen-tal psychologists, that after learning to respond in a particular way in the presence of a particular stimu-lus, say frequency of a tone, the same behavior may also occur in the presence of other more or less sim-ilar stimuli, say, other frequencies of the tone. It is an apparently simple logical step to suggest that behavior learned in a therapy environment might also appear in nontherapy, real-world environments, and when it does so, the result can be called general-ization (but see Johnston, 1979, for problems with such a simple extrapolation). Because applied behavior analysis generally involves establishing conditions that alter behavior, the issue of whether those changes are restricted to the learning situa-tions arranged or whether they become evident in other situations is usually important. For example, if a child who engages in aggressive behavior is exposed to a treatment program to reduce aggres-sion, a goal would be to decrease aggression not only in the treatment setting but in all settings.

In a seminal article, Stokes and Baer (1977) dis-cussed the issue of generalization of treatment effects. A key contribution of their article was to indicate that in general, if effects of a treatment are to be manifested in a variety of circumstances, achieving that outcome must be considered in

designing the intervention intended to effect the change in behavior. That is, it is not always suffi-cient to simply arrange circumstances that produce a desired change in behavior in the circumscribed environment in which the treatment is undertaken. Instead, procedures should be used that increase the probability that the change will be enduring and manifested in those parts of a client’s environment in which the changes are useful. That insight has been followed by the development of general strate-gies to enhance the likelihood that behavior changes occur not only in the treatment environment but also in other appropriate ones.

For example, Miltenberger (2008) described several general strategies that can be used to pro-mote generalization of treatment effects. The most direct strategy is to arrange for rewards to occur immediately after instances of generalization occur. Such an approach essentially entails taking treat-ment to the environments in which it is hoped the new behavioral patterns will occur. That is, the training environment is not explicitly delimited. Such an approach is now widespread in applied behavior analysis partly as a consequence of an emphasis on analyzing reinforcement functions before implementing treatment (see Iwata, Dorsey, Slifer, Bauman, & Richman, 1982). This approach to problem behavior entails discovering whether the behavior is maintained by reinforcement, and if it is, identifying what the reinforcers are in the environments in which the problem behavior occurs. Once the reinforcers responsible for the maintenance of the problem behavior are identi-fied, then procedures based on that knowledge are implemented in the situations in which the behav-ior occurs.

A related second strategy identified by Milten-berger (2008) is consideration of the conditions operating in the environments in which the changed behavior would be occurring. The idea here is that behavior that is changed in the therapeutic setting, for example learning effective social skills, will lead, if performed, to more satisfying social interactions in the nontherapy environment, and those successes will help to solidify the gains made in the therapy sessions. In designing the therapeutic goals, there-fore, consideration is given to what sorts of behavior

Page 19: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

169

are most likely to be successful in the nontherapy environment.

A less obvious strategy applies when the nonther-apy environment appears to offer little or no support for the changed behavior. An example is when ther-apy is aimed at training an adolescent to walk away from aggressive provocation in a schoolyard. Behav-ing in such an aggression-thwarting manner is not likely to result in positive outcomes with peers, who are instead likely to provide taunts and jeers after such actions. In such a case, it may be prudent to try to change the normal consequences in the school-yard by having teachers or other monitors provide positive consequences (perhaps in the form of privi-leges, praise, etc.) for such actions when they occur. That is, the strategy here involves altering the con-tingencies operating in the nontherapy environment.

A fourth general strategy is to try to make the therapy setting more like the nontherapy environ-ment in which the changed behavior is to occur. A study by Poche, Brouwer, and Swearingen (1981) illustrated this approach. They taught abduction prevention skills to preschool children, but in so doing incorporated a relatively large number of abduction lures in the training. The intent was that by including a wide variety of possible lures that might be used by would-be kidnappers, the training would be more effective in real-world situations than if it had not involved those variations. The gen-eral strategy in this case was to train with as many likely relevant situations as possible. Another way to view this strategy is that it involves incorporating stimuli that are present in the nontherapy environ-ment into the training.

A fifth approach is somewhat less intuitive, but research has suggested that it may be effective. The core idea is that if a variety of different forms of effective behavior are established by the therapy or training, the chance of effective behavior occurring in the nontherapy environment is better, and as a result the successful behavior will be supported and continue to occur. As a simple illustration, Milten-berger (2008) offered the example of teaching a shy person a variety of specific ways to ask for a date, which provides the person with several actions to try, some of which are likely to be successful outside of therapy.

In this section, we focused on particular strate-gies for ensuring that desired changes in behavior established through therapeutic methods occur and persist in nontraining or nontherapy environments, that is, in the everyday world. Employment of tactics emerging from the strategies described has yielded many successes, and the methods are part of the armamentarium of applied behavior analysts. These techniques to promote generalization of behavior changes have emerged from a consideration of fundamental behavioral processes that have been identified and analyzed in basic research and then subsequently validated as effective through applied research. They represent, consequently, what can be called successful transfer from basic science to effec-tive technology, namely, an instance of what has come to be called technology transfer. In the next section, we discuss some general principles of effec-tive technology transfer.

Technology TransferPeople often use the term technology to refer to the body of applied knowledge and practices that ema-nate from basic science. The term technology transfer refers to the process by which the discoveries or inventions of basic science actually make their way into the body of technology and become available for use outside the laboratory by any individual willing to undergo the expense of acquiring the technology. Technology transfer can occur with its basis in any science, so general principles exist that apply across scientific disciplines. The process is somewhat complex, and pausing to review some of the basic details that apply when a technology is brought to the commercial level will be helpful. The criteria, both legal and practical, that must be met for successful technology transfer set an upper limit against which applications of science can be evaluated.

A discovery or invention is an item of intellectual property. It belongs to the inventor or discoverer or whoever sponsored the research that led to its exis-tence. The transfer process usually involves a second party taking ownership of the property and thus the right to exploit it for commercial gain. It therefore must be protected, usually by means of a patent or copyright. Once ownership is secured, it can be

Page 20: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

170

transferred in exchange for a lump sum payment or, more often, a license agreement whereby the licensee acquires the exclusive right to produce and distribute the technology and the licensor is entitled to a royalty based on sales. Thus, for example, the Quaker Oats Company executed a licensing agree-ment with the University of Florida that allowed the company to produce and distribute Gatorade exclu-sively in exchange for a royalty in the form of a per-centage of revenue.

Requirements for Technology TransferFor a candidate technology to meet the eligibility requirements for transfer, it must meet three crite-ria: quantification, repetition, and verification. These terms arise from the engineering literature (e.g., Hench, 1990); their counterparts in the behav-ioral literature will become obvious in a moment. Let us consider these characteristics in more detail and see how they conform to the products of behavior-analytic research.

First, we discuss quantification. Behavior analy-sis has long used the measurement strategies of the natural sciences (Johnston & Pennypacker, 1980) with the result that, as Osborne (1995) stated,

Physical standards of measurement bind behavior analysis to the physical and natural sciences. Interpretation of depen-dent variables need not change from experiment to experiment. It is a feature of . . . idemnotic measures that response frequencies on a particular parameter of a fixed-ratio schedule of reinforcement can be compared validly within sessions and across sessions, within laboratories and across laboratories, within species and across species. (p. 249)

We are therefore able to state precisely and unam-biguously the quantitative characteristics of behav-ior resulting from application of a particular procedure.

Repetition is the practical use of replication as discussed earlier. The phenomenon must be able to be reproduced at will for it to serve as a component of a transferrable technology. An early (late 1950s and early 1960s) example of this feature is the

application by pharmaceutical companies of the reproducible effects of reinforcement schedules in evaluating drugs. A standard approach was to estab-lish a known baseline of performance by an animal using a particular schedule of reinforcement, then evaluate the perturbation, if any, of an experimental compound on that performance. If the perturbation could be reliably reproduced (repeated), the relation between the compound and its effect on behavior was affirmed (cf. McKearney, 1975; Sudilovsky, Gershon, & Beer, 1975).

Verification is most akin to the concept of gener-ality. In establishing the generality of a behavioral process or phenomenon, researchers seek to specify the range of conditions under which the phenome-non occurs while eliminating those that are irrele-vant or extraneous. Similarly, when transferring a technology, the recipient must be afforded a com-plete specification of the necessary and sufficient conditions for reproduction of the effects of the technology. Extensive research to nail down the parameters of generality is the only way to achieve this objective. It cannot be obtained by appeal to the results of significance tests for all of the reasons detailed earlier.

A simple yet elegant example of a well-established behavioral technology that has easily transferred is the Invisible Fence for animal control, which is a direct application of the principles of signaled avoid-ance. A wire is buried underground around the perimeter of the enclosed area in which the animals are to remain. The animal wears a collar that receives auditory signals (beeps) and delivers electric shocks through two electrodes that contact the animal’s neck. As the animal comes within a few feet of the wire, a beep sounds. If the animal proceeds, it receives a shock. The owner teaches the animal to withdraw on hearing the beep by carrying (or lead-ing) the animal into the proximity of the beep, then shouting “No!” and carrying or leading the animal away from the beep. After several such trials, the ani-mal is released and will eventually receive the shock. Its escape response has been well learned, and it will very likely never contact the shock again. Rather, it will avoid it when it hears the beep.

A more elaborate example of a behavioral tech-nology that has been successfully transferred is the

Page 21: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

171

MammaCare Method of manual breast examination as a means of early detection of breast cancer ( Pennypacker, 1986). This example is unusual because the basic research was conducted by the same individuals who eventually took the technol-ogy to market. A capsule history of the development of MammaCare and its subsequent transfer is avail-able online (Pennypacker, 2008).

In brief, a high-fidelity simulation of human breast tissue was created with the help of materials science engineers. Patent protection for this device was obtained. Basic psychophysical research using this simulation was conducted to determine the lim-its of detectability by human fingers of lifelike simu-lations of breast tumors and other abnormalities. Once these limits were known, behavioral studies isolated the components of a technique that allowed an examiner to approach the established psychophys-ical limits. Early translational research established that practicing the resulting techniques on the simu-lation enabled examiners to detect real lumps in live breast tissue at about twice their baseline accuracy. Extensive research was then undertaken to establish procedures for teaching the new examination tech-nique, which became known as MammaCare.

Technology transfer became possible when stan-dards of performance were established and training methods were devised that could be readily repeated and whose results could be verified. As a result, individuals wanting to offer such training either to the public or to other medical professionals (people who routinely perform clinical breast examinations, e.g.) may now become certified and can operate independently.

Technology transfer was greatly accelerated in the United States by the passage in 1980 of the Bayh–Dole Act, which made it possible for institu-tions conducting research with federal funds to retain the products of that research. Offices of licensing and technology soon appeared in all of the major research universities, which in turn began licensing to private organizations the products of their sponsored research. The resulting revenue in the form of fees and royalties has been of significant benefit to these institutions. Most of this activity, however, has taken place in the hard sciences, engineering and medicine. Fields such as behavior

analysis were not ready to enjoy the stimulative effects of the Bayh–Dole Act at the time it became law. Analogous federal attention to the clinical and behavioral sciences has emerged in the form of the National Institutes of Health’s National Center for Research Resources, which makes Clinical and Translational Science Awards. Mace and Critchfield (2010) cited a statement by the acting director of the National Institutes of Health’s Office of Behavioral and Social Sciences to the effect that “its [the insti-tute’s] mission is science in pursuit of fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to extend healthy life and reduce the burdens on ill-ness and disability” (p. 307).

The aim is to accelerate the application of basic biomedical research to clinical problems. We now turn our attention to this type of research.

Translational ResearchFrom our perspective, translational research is a node along the continuum from basic bench science to the sort of application that results from technol-ogy transfer. The continuum is abstractly defined by generality. The ultimate goal of translational research may be broadly seen as establishing the limits of generality of whatever variable, process, or procedure is of interest.

Translational research is therefore a somewhat less stringent endeavor than full technology transfer to bridge the gap from bench to bedside. A distin-guishing feature of this approach is that the basic scientist and clinician often collaborate in a syner-gistic manner. This practice will likely accelerate the development of candidate technologies because the applied aspect is undergoing constant examination and refinement.

Lerman (2003) has provided an excellent over-view of translational research in behavior analysis. She correctly observed that the bulk of the literature on applied behavior analysis consists of reports of translational research. She went on to describe a series of maturational stages of this endeavor, begin-ning with the early demonstrations of the generality of the process of reinforcement across species, individuals, and settings. From these emerged con-cerns with other basic processes such as extinction,

Page 22: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

172

stimulus generalization, discrimination, and the effects of basic contingencies of reinforcement.

As these types of demonstrations increasingly proved beneficial to clinical populations, a new dimen-sion of this research emerged that focused on issues of training, maintenance of benefits, and even the implica-tions of such practices for public policy. Concurrently, focus shifted from individual cases to larger entities such as schools, corporations, and even military units.

At the same time, a small but growing body of translational research will explicitly hasten the devel-opment of mature technologies that can be transferred with the usual attendant financial and cultural bene-fits. One such effort is a study by St. Peter Pipkin, Vollmer, and Sloman (2010), who examined the effects of decreasing adherence to a schedule of differ-ential reinforcement of alternative behavior, first in a laboratory setting and then, using the exact same reinforcement schedule parameters, in an educational setting with two individuals with educational handi-caps. They explored the generality of a procedure across settings and populations and further docu-mented the effects of deliberate failure to impose the differential reinforcement of alternative behavior schedule as specified, either by not delivering rein-forcement when required or by “accidentally” allow-ing undesirable behavior to contact reinforcement at various times. These manipulations constitute an attempt to demonstrate the consequences of break-downs in treatment integrity, which in some cases can be highly destructive and in others may be negligible.

The type of translational research just described constitutes an important step toward the develop-ment of a technology that can be transferred in the sense discussed earlier. Treatment integrity is directly analogous to what the engineers call verifi-cation. St. Peter Pipkin et al. (2010) provided guid-ance for establishing a range of allowable treatment integrity failure within which effectiveness may be maintained, which is akin to specifying tolerances in a manufacturing process or allowable failure rates of components in complex equipment.

General Considerations for Translational ResearchMace and Critchfield (2010) offered an informative perspective on the current role of translational

research in behavior analysis. They stressed the importance of conducting research that can be of more or less immediate benefit to society if substan-tial societal support for such research is to occur. In our view, the likelihood of such research actually attaining that criterion would be augmented to the extent that researchers keep as their ultimate goal development of a technology that can be transferred as we have discussed.

In fact, very few examples of commercially trans-ferrable technology have yet to emerge from behav-ior analysis (Pennypacker & Hench, 1997). There is, however, sufficient promise in the replicability of the discipline’s basic findings to encourage development of transferrable technologies, and the availability of substantial research support is critical. More transla-tional research aimed at identifying and isolating the conditions under which specified procedures can be assured to consistently generate measurable and desirable effects on the behavior of individuals will hasten the emergence of such technologies.

SUMMING UP

The subdiscipline known as behavior analysis offers an alternative approach to assessing reliability and generality of research findings, that is, an approach that is different from that used by most psychologi-cal researchers today. The methods that provide ave-nues to assessing reliability and generality may be of interest to psychologists who approach the field of behavior (or mind) from perspectives other than those shared by behavior analysts. At this juncture in the history of behavioral science, the methods might be especially attractive to researchers who are coming into contact with the substantial limitations of traditional methods that rely on group averages and null-hypothesis significance testing. In the early sections of this chapter, we reiterated those weak-nesses because it appears that many behavioral researchers are not aware of them.

Our main thrust in this chapter has been to describe and characterize types of replication and the roles that they play in determining the reliability and generality of research outcomes. We have espe-cially emphasized the role replication can play in assessing the generality of research findings, both

Page 23: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

173

across subjects and conditions and of theoretical assertions. We have argued, in fact, that replication, both direct and systematic, currently represents the only set of methods that can determine whether results are reliable and how general they are. Our claim is a strong one, and it is not that replication is an alternative set of methods but rather that it is the only way to determine reliability and generality given current knowledge. Replication has served the more developed sciences very effectively, and it is our contention that it can serve behavioral science, too. At the very least, we hope we have convinced the reader that paying greater attention to replica-tion will advance behavioral science more surely and more rapidly than the methods currently in fashion.

In the final section of the chapter, we focused on issues of applying science to problems the world faces. Once reliability and generality of research findings have been established to an appropriate degree, it is sometimes possible to take advantage of that knowledge for the betterment of people and society. There are guideposts about how best to do that, and we have discussed some of them.

The other chapters in this handbook present a wide-ranging description of the research and appli-cation domains that constitute behavior analysis. We see those chapters as testament to the coherent science and technology that can be developed when the markers of reliability and generality have been established through research founded on direct and systematic replication.

ReferencesAnscombe, F. J. (1973). Graphs in statistical analysis.

American Statistician, 27, 17–21. doi:10.2307/2682899

Baer, D. M., Peterson, R. F., & Sherman, J. A. (1967). The development of imitation by reinforcing behavioral similarity to a model. Journal of the Experimental Analysis of Behavior, 10, 405–416. doi:10.1901/jeab.1967.10-405

Bakan, D. (1966). The test of significance in psychologi-cal research. Psychological Bulletin, 66, 423–437. doi:10.1037/h0020412

Bayh-Dole Act, Pub. L. 96-517, § 6(a), 94 Stat. 3018. (1980).

Bernard, C. (1957). An introduction to the study of experi-mental medicine. New York, NY: Dover. (Original work published 1865)

Carver, R. P. (1978). The case against statistical sig-nificance testing. Harvard Educational Review, 48, 378–399.

Cleveland, W. (1994). The elements of graphing data. Summit, NJ: Hobart Press.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003. doi:10.1037/0003-066X.49.12.997

Darwin, C. (1859). On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. London, England: John Murray.

Deese, J. (1959). On the prediction of occurrence of par-ticular verbal intrusions in immediate recall. Journal of Experimental Psychology, 58, 17–22. doi:10.1037/h0046671

de Hefferline, R. F., Keenan, B., & Harford, R. A. (1959). Escape and avoidance conditioning in human sub-jects without their observation of the response. Science, 130, 1338–1339.

Duhem, P. (1954). The aim and structure of physical theory (P. P. Wiener, Trans.). New York, NY: Princeton University Press.

Dunn, K. E., Sigmon, S. C., Thomas, C. S., Heil, S. H., & Higgins, S. C. (2008). Voucher-based contin-gent reinforcement of smoking abstinence among methadone-maintained patients: A pilot study. Journal of Applied Behavior Analysis, 41, 527–538. doi:10.1901/jaba.2008.41-527

Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory and Psychology, 5, 75–98. doi:10.1177/0959354395051004

Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you always wanted to know about significance testing but were afraid to ask. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 391–408). Thousand Oaks, CA: Sage.

Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research—Online, 7(1), 1–20.

Hench, L. L. (1990, August). From concept to commerce: The challenge of technology transfer in materials. MRS Bulletin, pp. 49–53.

Iwata, B. A., Dorsey, M. F., Slifer, K. J., Bauman, K. E., & Richman, G. S. (1982). Toward a functional analysis of self-injury. Analysis and Intervention in Developmental Disabilities, 2 3–20.

Johnston, J. M. (1979). On the relation between general-ization and generality. Behavior Analyst, 2, 1–6.

Johnston, J. M., & Pennypacker, H. S. (1980). Strategies and tactics of human behavioral research. Hillsdale, NJ: Erlbaum.

Page 24: 7 Branch&Pennypacker Generalization and Generality

Branch and Pennypacker

174

Johnston, J. M., & Pennypacker, H. S. (2009). Strategies and tactics of behavioral research (3rd ed.). New York, NY: Routledge.

Kalinowski, P., Fidler, F., & Cumming, G. (2008). Overcoming the inverse probability fallacy: A com-parison of two teaching interventions. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 4, 152–158.

Lane, H. (1960). Control of vocal responding in chickens. Science, 132, 37–38. doi:10.1126/science.132.3418.37

Lerman, D. C. (2003). From the laboratory to community application: Translational research in behavior analy-sis. Journal of Applied Behavior Analysis, 36, 415–419. doi:10.1901/jaba.2003.36-415

Loftus, G. R. (1991). On the tyranny of hypothesis test-ing in the social sciences [Review of The empire of chance: How probability changed science and everyday life]. Contemporary Psychology, 36, 102–105.

Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161–171. doi:10.1111/1467-8721.ep11512376

Mace, F. C., & Critchfield, T. S. (2010). Translational research in behavior analysis: Historical tradi-tions and imperative for the future. Journal of the Experimental Analysis of Behavior, 93, 293–312. doi:10.1901/jeab.2010.93-293

Martin, G., & Pear, J. (2007). Behavior modification: What it is and how to do it (8th ed.). Upper Saddle River, NJ: Pearson.

McKearney, J. W. (1975). Drug effects and the environ-mental control of behavior. Pharmacological Reviews, 27, 429–436.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103–115.

Meehl, P. E. (1978). Theoretical risks and tabular aster-isks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. doi:10.1037/0022-006X.46.4.806

Miltenberger, R. G. (2008). Behavior modification: Principles and procedures (4th ed.). Belmont, CA: Thompson.

Nickerson, R. S. (2000). Null hypothesis significance test-ing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. doi:10.1037/1082-989X.5.2.241

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. Chichester, England: Wiley.

Osborne, J. G. (1995). Reading and writing about research methods in behavior analysis: A personal

account of a review of Johnston and Pennypacker’s Strategies and Tactics of Behavioral Research (2nd ed.) and others. Journal of the Experimental Analysis of Behavior, 64, 247–255. doi:10.1901/jeab.1995.64-247

Pennypacker, H. S. (1986). The challenge of technol-ogy transfer: Buying in without selling out. Behavior Analyst, 9, 147–156.

Pennypacker, H. S. (2008). A funny thing happened on the way to the fortune, or lessons learned during 25 years of trying to transfer a behavioral technology. Behavioral Technology Today, 5, 1–31. Retrieved from http://www.behavior.org/resource.php?id=188

Pennypacker, H. S., & Hench, L. L. (1997). Making behavioral technology transferrable. Behavior Analyst, 20, 97–108.

Penston, J. (2005). Large-scale randomized trials—A mis-guided approach to clinical research. Medical Hypotheses, 64, 651–657. doi:10.1016/j.mehy.2004.09.006

Poche, C., Brouwer, R., & Swearingen, M. (1981). Teaching self-protection to young children. Journal of Applied Behavior Analysis, 14, 169–175. doi:10.1901/jaba.1981.14-169

Roediger, H., & McDermott, K. (1995). Creating false memories: Remembering words not presented in lists. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 803–814. doi:10.1037/0278-7393.21.4.803

Rozeboom, W. W. (1960). The fallacy of the null hypoth-esis significance test. Psychological Bulletin, 57, 416–428. doi:10.1037/h0042040

Sidman, M. (1960). Tactics of scientific research. New York, NY: Basic Books.

Skinner, B. F. (1932). On the rate of formation of a con-ditioned reflex. Journal of General Psychology, 7, 274–286. doi:10.1080/00221309.1932.9918467

Smithson, M. (2003). Confidence intervals. London, England: Sage.

Stokes, T. F., & Baer, D. M. (1977). An implicit technol-ogy of generalization. Journal of Applied Behavior Analysis, 10, 349–367. doi:10.1901/jaba.1977.10-349

St. Peter Pipkin, C., Vollmer, T. R., & Sloman, K. N. (2010). Effects of treatment integrity failures during differential reinforcement of alternative behavior: A translational model. Journal of Applied Behavior Analysis, 43, 47–70. doi:10.1901/jaba.2010.43-47

Sudilovsky, A., Gershon, S., & Beer, B. (Eds.). (1975). Predictability in psychopharmacology: Preclinical and clinical correlations. New York, NY: Raven Press.

Thorndike, E. L. (1898) Animal intelligence: An experi-mental study of the associative processes in animals. Psychological Review, 11(4, Whole No. 8).

Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 83–91. doi:10.1037/h0027108

Page 25: 7 Branch&Pennypacker Generalization and Generality

Generality and Generalization of Research Findings

175

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

Ullman, L. P., & Krasner, L. (Eds.). (1966). Case stud-ies in behavior modification. New York, NY: Holt, Rinehart & Winston.

Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines

and explanations. American Psychologist, 54, 594–604. doi:10.1037/0003-066X.54.8.594

Williams, B. A. (2010). Perils of evidence-based medicine. Perspectives in Biology and Medicine, 53, 106–120. doi:10.1353/pbm.0.0132

Yates, A. J. (1970). Behavior therapy. New York, NY: Wiley.