McCleary & McDowall (2012) Time-series Designs

Embed Size (px)

Citation preview

  • CHAPTER 32

    TIME-SERIES DESIGNS Richard McCleary and David McDowall

    Time-series designs are distinguished from other designs by the properties of time-series data and by the necessary reliance on a statistical model to con-trol threats to validity. In the long run, a time-series is a realization of a latent causal process. Represent-ing the complete time-series as

    the observed series (Y1, ... , YN) is a probability sample of the complete realization. The probability sampling weights for (Y1, ... , YN) are specified in a statistical model, which, for present purposes, is written in a general linear form as

    (2)

    The a, term of this model is the tth observation of a strictly exogenous innovation series with the white noise property,

    (3)

    The X, term is the tth observation of a causal time-series. Although X, is ordinarily a binary variable coded for the presence or absence of an interven-tion, it can also be a purely stochastic series. In either case, the model is constructed by a set of rules that allow for the solution,

    (4)

    Because a, has white noise properties, the solved model satisfies the assumptions of all common tests of statistical significance.

    DOl: 10.1037/13620-032

    THREE DESIGN CATEGORIES

    We elaborate on the specific forms of the general model and on the set of rules for building models at a later point. For present purposes, the general model allows three design variations: (a) descriptive time-series designs, (b) correlational time-series designs, and (c) experimental or quasi-experimental time-series designs.

    Descriptive Designs The earliest time-series designs used observed cycles or trends in a series to infer the nature of a latent causal mechanism. Historical examples include Wolf's (1848; Yule, 1927) investigation of sunspot activity and Elton's (1924; Elton & Nicholson, 194 2) investigation of lynx populations. In both cases, time-series analyses revealed cycles or trends that corroborated substantive interpretations of the phenomenon.

    Kroeber's (1919; Richardson & Kroeber, 1940) analyses of cultural change illustrate the poor fit of descriptive time-series designs to many social and behavioral phenomena. Kroeber (1944) hypothe-sized that women's fashions change in response to political and economic variables. During stable peri-ods of peace and prosperity, fashions changed slowly; during wars, revolutions, and depressions, fashions changed rapidly. Because political and eco-nomic cataclysms were thought to recur in long his-torical cycles, Kroeber tested his hypothesis by searching for the same cycles in women's fashions.

    Figure 32.1 plots one of Richardson and Kroe-ber's (1940) annual fashion time series. Although

    APA Handbook of Research Methods in Psychology: Vol. 2. Research Designs, H. Cooper (Editor-in-ChieD Copyright 2012 by the American Psychological Association. All rights reserved.

    613

  • McCleary and McDowall

    Skirt Width Index

    1800 1820 1840 1860 1880 1900 1920 Year

    FIGURE 32.1. Annual skirt widths, 1787 to 1936.

    Kroeber believed that the long cycles in this series corroborated his cultural unsettlement theory, wholly random processes can generate identical pat-terns. Whereas most time-series designs treat j(aJ as a nuisance function whose sole purpose is to con-trol the threats to statistical conclusion validity posed by cycles and trends, the descriptive time-series design infers substantive explanations from j(at). Although the statistical models and methods developed for the analysis of descriptive designs are applied for exploratory purposes (see Mills, 1991), they are currently not widely used for null hypothe-sis tests.

    Correlational Designs A second type of time-series design attempts to infer a causal relationship between two series from their covariance. Historical examples include Chree's (1913) analyses of the temporal correlation between

    Lynchings

    sunspot activity and terrestrial magnetism and Bev-eridge's (1922) analyses of the temporal correlation between rainfall and wheat prices. The validity of correlational inferences rests heavily on theory. When theory can specify a single causal effect oper-ating at discrete lags, as in these natural science examples, correlational designs support unambigu-ous causal interpretations. Lacking theoretical speci-fication, however, correlational designs do not allow strong causal inferences.

    Analyses of the temporal correlation between lynchings and cotton prices by Hovland and Sears (1940) illustrate the inferential problem. To test the frustration-aggression hypothesis of Dollard et al. (1939), Hovland and Sears estimated a Pearson product-moment correlation coefficient from the annual time-series plotted in Figure 32.2. Assuming that the correlation would be zero in the absence of a causal relationship, Hovland and Sears interpreted the statistically significant estimate as corroborating evidence. Because of common stochastic time-series properties, however, especially trend, causally inde-pendent series will be correlated. Controlling for trend, Hepworth and West (1988) reported a small, significant correlation between the series but warned against causal interpretations. The correlation is an artifact of the war years 1914 to 1918, when the demand for cotton and the civilian population moved in opposite directions (McCleary, 2000). If the war years are excluded, the correlation is not statistically significant.

    Where theory supports strong specification, cor-relational time-series designs continue to be used.

    150 -;-Cotton Prices

    100

    50

    1890

    -Lynchings

    1900 1910 Year

    FIGURE 32.2. Annual cotton prices and lynchings, 1886 to 1930.

    614

    1920

    -

  • r Other than limited areas in economics and psychol-ogy, however, social theories will not support the required specification. Even in these areas, causal inferences require the narrow definition of Granger causality (Granger, 1969) to rule out plausible alter-native interpretations.

    Quasi-Experimental Designs The third type of time-series design infers the latent causal effect of a temporally discrete intervention or treatment from discontinuities or interruptions in a time series. Campbell and Stanley (1963, pp. 37-43) called this design the "time-series experiment," and its use is currently the major application of time-series data for causal inference. Historical examples of the general approach include investigations of workplace interventions on health and productivity by the British Industrial Fatigue Research Board (Florence, I923) and by the Hawthorne experiments (Roethlisberger & Dickson, I939). Fisher's (1921) analyses of agricultural interventions on crop yields also relied on variants of the time-series quasi-experiment.

    In the simplest case of the design, a discrete intervention breaks a time-series into pre- and post-intervention segments of Npre and Npast observations. For pre- and postintervention means, J.lpre and ].lpost, analysis of the quasi-experiment tests the null hypothesis

    Ha: (!) = 0, where(!)= J.lpost- ].lpre; HA: (!);F. 0. (5) Rejecting H0, HA attributes ro to the intervention. In practice, however, treatment effects are almost always more complex than the simple change in level implied by this null hypothesis.

    Figure 32.3 illustrates a typical example of the time-series quasi-experimental design. The data are 50 daily self-injurious behavior counts for an insti-tutionalized patient (McCleary, Touchette, Tay-lor, & Barron, 1999). Beginning on the 26th day, the patient is treated with Naltrexone, an opiate-blocker. The plotted time series leaves the visual impression that the opiate-blocker has reduced the incidence of self-injurious behavior. Indeed, the dif-ference in means for the 25 pre- and 25 postinter-vention days amounts to a 4 2% reduction. The value ofF= 32.56 associated with this difference occurs

    Time-Series Designs

    Self-Injurious Behavior Incidents

    8

    6

    5

    4

    25 Preintervention Days 25 Postintervention Days

    FIGURE 32.3. Self-injurious behavior incidents for a single institutionalized patient before and after an opiate-blocker regimen.

    by chance with p < .0001, and the null hypothesis can be rejected in favor of the alternative: The medi-cation is an effective treatment for self-injurious behavior.

    This conclusion ignores a serious threat to valid-ity. Whereas the null hypothesis test assumes that the daily counts are independent, in fact, the count on any given day is predictable from the count on the preceding day. The visual evidence of Figure 32.3 leaves the unambiguous impression that the opiate-blocker reduced the rate of self-injurious behavior for this patient. Visual evidence can be deceiving, of course, and that is why statistical hypothesis tests are conducted. Because of the day-to-day dependence of these data, however, the value ofF= 32.56 cannot be interpreted.

    More generally, time-series experiments and quasi-experiments present many challenges to valid inferences about causal effects. A solid ratio-nale for the design is mostly due to the work of Donald T. Campbell and his collaborators (Camp-bell & Stanley, 1963; Cook & Campbell, 1979; Shadish, Cook, & Campbell, 2002). Campbell and associates extensively considered threats to the design's validity and proposed ways to address them when they were plausible. A general conclu-sion from Campbell's work is that experimental and quasi-experimental time-series designs face fewer threats to validity than do most other nonex-perimental designs. This conclusion is largely responsible for the current popularity of time-series research.

    615

  • I

    McCleary and McDowall

    FOUR TYPES OF VALIDITY

    Campbell and Stanley (1963; Campbell, 1963) divided the empirical threats to valid inference into two categories. Threats to internal validity addressed the question, "Did in fact the experimental treat-ments make a difference in this specific experimen-tal instance?" Threats to external validity addressed the question, "To what populations, settings, treat-ment variables, and measurement variables can this effect be generalized?"

    Recognizing the incompleteness of the dichot-omy, Cook and Campbell (1979) added two addi-tional categories. Threats to statistical conclusion validity addressed questions of confidence and power that had previously been included implicitly as threats to internal validity. Threats to construct validity addressed questions of confounding that had previously been included implicitly as threats to external validity. Shadish et al. (2002) used the same four categories but expanded the list of threats to valid inference in each category.

    Table 32.1lists the threats to validity that are rel-evant to time-series studies. Time-series designs dif-fer from other approaches in that common threats to validity are controlled by a statistical model. This applies not only to quasi-experimental designs but also to designs that would be considered true experi-ments. When a treatment or intervention can be

    TABlE 32.1

    Four Types of Validity From Shadish, Cook, and Campbell (2002) Type of validity Statistical conclusion

    validity

    Internal validity

    Construct validity

    External validity

    Threat to validity Low statistical power

    Violated assumptions of the test History Maturation Regression artifacts Instrumentation Reactivity Novelty and disruption Interaction over treatment variations Interaction with settings

    manipulated experimentally, presumably to control threats to internal validity, the manipulation raises threats to construct and external validity that can be controlled by the statistical model.

    Trade-offs among the four validities are implicit in our tradition. The salient flaw in the Campbell and Stanley (1963) two-category system was that all eight threats to internal validity and all four threats to external validity could be controlled by design. Campbell and his colleagues proposed the four-validity system in large part to correct this miscon-ception. The trade-off among validities is a crucial consideration for time-series studies. Although threats to internal validity can be controlled, in prin-ciple, by experimental manipulation of the treat-ment, in practice, experimental manipulation raises near-fatal threats to construct and external validity. Accordingly, we analyze time-series experiments as if they were quasi-experiments.

    Statistical Conclusion Validity Shadish et al. (2002) identified nine threats to statis-tical conclusion validity or "reasons why researchers may be wrong in drawing valid inferences about the existence and size of covariation between two vari-ables" (p. 45). Although the consequences of any particular threat will vary across settings, the threats to statistical conclusion validity fall neatly into cate-gories involving misstatements of the Type I and Type II error rates. 1

    Type I errors (also known as a-errors or false-positive errors) occur when a true H0 is mistakenly rejected; that is, when the intervention has no effect but the test statistic suggests otherwise. Under a convention established by Fisher (1925), the Type I error rate is fixed at a~ 0.05, corresponding to a confidence level of at least 0.95 (or 95% confidence).

    Type II errors (also known as {3-errors or false-negative errors) occur when a false H0 is mistakenly accepted; that is, when the intervention has an effect but the test statistic suggests otherwise. Following Neyman and Pearson (1928), the conventional Type II error rate is fixed at (3 ~ 0.2, corresponding to sta-tistical power level of at least 0.8 (or 80% statistical power). Whereas the Type I error rate is set a priori, 1~:t:~~~c~~h:e~e;~1ve authority on this topic is Kendall and Stuart (1979, Chapter 22). Cohen (1988) and Lipsey (1990) provided more accessible

    616

    1 . . -'q

    - ~

  • the Type II error rate is conditioned on the Type I error rate and a likely effect size. 2

    For both Type I and Type II errors, uncontrolled threats to statistical conclusion validity distort the nominal values of a and [3, leading to invalid infer-ences. The threats are controlled by a statistical model. The most widely used models for that pur-pose are the AutoRegressive Integrated Moving Average (ARIMA) models of Box and jenkins (1970). Under H0 , an ARIMA model is written as

  • McCleary and McDowall

    0.5 Property Crimes

    1956

    0 1951

    -0.5

    -1

    -1.5

    -2 1945 1950 1955 1960

    FIGURE 32.4. Annual property crimes for cities with commercial television broadcast service in 1951 and for cities with television in 1955.

    Internal Validity Under some circumstances, all nine of the threats to internal validity identified by Shadish et al. (2002, p. 55) might apply to the time-series quasi-experiment. Typically, however, only four threats are plausible enough to pose serious difficulties. History and maturation, which arise from the use of multiple temporal observations, can threaten virtu-ally any application of the design. Instrumentation and regression can also be problems, but only for interventions that involve advanced planning. For unplanned interventions-what Campbell (1963, pp. 229-230) called "natural experiments"-these threats are much less realistic.

    The largest, most obvious, and most frequent threats to internal validity involve the operation of history. Historical threats come from changes in a time series that occur coincidently with an interven-tion but are due to other causes. A standard design-based approach to making these threats less plausible is to analyze one or more comparison series. The comparisons can take many forms, and a careful choice can substantially narrow the scope within which history can operate. An analysis might consider no-treatment control series that the inter-vention should not have influenced, for example, or study multiple periods during which the interven-tion was and was not in operation. A consistent pat-tern of results across different variables or time periods reduces the plausibility of historical threats and helps support the existence of an intervention impact.

    618

    An illustration of the effective use of comparison series comes from Hennigan et al. (1982), who studied changes in property crime following the introduction of commercial television. In 34 early cities, broadcasting began in 1950, whereas in 34 late cities, television was not available until1954. The time series in Figure 32.4 show the annual log-transformed levels of property crimes for both groups of cities.

    History is a plausible threat to inferences about the effects of television when studying the early or late cities alone. Other variables also changed during the years in which each group adopted television, and many of these might explain a change in crime. The Korean War was under way in 1951, for exam-ple, and an economic recession began in 1955. More generally, criminological theories suggest multiple factors that might influence property crimes and that could have changed around the time that broad-casting began in either of the groups.

    Considering both time series together makes the changes in crime much more difficult to dismiss as artifacts of history. To be plausible, a historical explanation would have to account for increases that occurred at two different time points but affected only one group of cities at each. Although not impossible, constructing such an explanation would be a difficult enterprise.

    In contrast to history, methods for addressing the other three threats to internal validity do not heavily rely on design variations. Maturation, which like history is plausible in all applications of the

  • Productivity 80

    70

    60

    50 First "Punishment"

    Effect

    Second "Punishment"

    Effect

    Third "Punishment"

    Effect

    Time-Series Designs

    40 ~-------------------------------------------

    FIGURE 32.5. Weekly productivity measures for the first bank wiring room (Hawthorne) experiment. Estimated interventions are plotted against the series.

    time-series quasi-experiment, requires a statistical modeling approach. Maturation threats appear as trends in the data and are due to developmental pro-cesses that are independent of the intervention. Time-series data often display such trends, and trending patterns are especially common in the long series that are most desirable for analysis. Matura-tional trends are a problem for inference because they can easily produce false evidence of an inter-vention effect.

    Figure 32.5 illustrates one of the best-known examples of the maturation threat, the so-called Hawthorne effect. The data consist of weekly pro-ductivity levels for a group of five machine operators in the bank wiring room of the Western Electric Company's Hawthorne Works. Researchers manipu-lated daily rest breaks during the study period and claimed after a visual inspection of the series that the breaks helped increase productivity. Question-ing this conclusion, Franke and Kaul (1978) argued that any productivity increases were instead due solely to fear generated by the imposition of three "punishment" regimes. Their statistical analysis, which included interventions at the beginning of each regime, supported this hypothesis.

    Maturation provides an explanation that chal-lenges the interpretations of both Franke and Kaul (1978) and the original researchers. Figure 32.5 shows the presence of a systematic trend in produc-

    tivity that could easily have resulted from increases in worker experience. The trend closely follows the patterns that each set of researchers observed, and this makes maturation a plausible alternative expla-nation of their findings. A reanalysis that controlled for the trend in fact found a small effect of the rest breaks and no effect at all for the punishment regimes (McCleary, 2000).

    Unlike other threats to internal validity, which are ordinarily handled by design, maturation threats are controlled by the statistical model. Under H0, the causal effect of Xt vanishes, leaving a simple model that represents the time series as a weighted sum of past and present white noise innovations:

    (11) Proper solutions of this model are guaranteed by constraining the parameters of cp(B) and 9(B) to the bounds of stationarity-invertibility.3

    This assumes a stationary time series, however, and this assumption is unwarranted in many instances. Kroeber's fashion time series (Figure 32.1), for example, shows the drifting pattern charac-teristic of a nonstationary random walk process. Seg-ments of Kroeber's series are indistinguishable from the steady secular trend that poses the maturation threat in the Hawthorne experiment (Figure 32.5).

    Although nonstationary time series are com-monly encountered in the social sciences, most are

    3Although the bounds of stationarity and invertibility are identical, they are distinct properties of a time series. See Box and jenkins (1970, pp. 53-54). All modem time-series software packages report parameter estimates that are constrained to the stationarity-invertibility bounds.

    619

  • McCleary and McDowall

    Productivity Differences

    10

    8

    6

    4

    2

    0

    -2

    -4

    -6

    FIGURE 32.6. Weekly productivity measures from Figure 32.5, differenced.

    stationary in first-differences. Using V to represent the differencing operation

    (12)

    a nonstationary time series can be modeled as

    (13)

    or

    (14) Figure 32.6 plots the first-differences of the Haw-thorne experiment time series. The differenced series fluctuates around a constant level, and matu-ration is no longer a plausible threat to internal validity. 4

    In addition to controlling maturation threats, differencing removes the confounding effects of cross-sectional fixed causes of 1 To illustrate, suppose that 1 is the U.S. unemployment rate and that W represents the causes of 1 that vary cross-sectionally but that are constant over short periods of time. When consecutive observations are differenced,

    Yt- Yt-1 = j(at) - j(at-1) + W- W (IS)

    V'Yt = j(at)- j(at-1), (16) the confounding effects of W vanish from the model. This property of the difference equation model is the

    motivation for the use of fixed-effects panel models in economics (Greene, 2000).

    Like the maturation threat to internal validity, the regression threat is controlled by the statistical model. Whereas the maturation threat is plausible in both time-series experiments and quasi-experiments, however, the regression threat is plau-sible only in quasi-experiments involving planned interventions. The regression threat arises whenever the intervention is a reaction to an unusually high or low level of the time series. Regardless of the inter-vention's effects, regression to the mean is likely to produce an increase or decrease in the series level.

    In one of the earliest formal applications of the time-series quasi-experiment, Campbell and Ross (1968) studied the impact on highway fatalities of a I955 speeding crackdown in Connecticut. Traffic deaths dropped significantly after the crackdown began, but Campbell and Ross showed that the decrease was largely attributable to a regression arti-fact. Fatalities were unusually high in 1955, and the crackdown was a response intended to reduce them. A drop was then predictable as deaths regressed back toward their historically average levels.

    Regression becomes a less plausible threat to internal validity as the length of a time series increases. Introducing the intervention at an unusually high (or low) point in the series will create a transient bias in estimates of the pre- and

    4The mean of the first-differenced time series is interpreted as the secular trend of Y,.

    620

  • p

    Burglaries

    100

    90

    80

    70

    60

    50

    40 L_ _____ 19_7_6 ________ 19~78 ________ 1_9~80~----~1982 Year

    FIGURE 32. 7. Monthly burglaries for Tucson, 1975 to 1981. During a 24-month period, burglaries are assigned to detectives for investigation.

    postintervention means. As the pre- and postinter-vention series grow longer, the bias becomes pro-portionately smaller, and eventually it reaches zero. The recommendation to use a total series length of 50 or more observations for the time-series quasi-experiment is in part intended to reduce the plausi-bility of regression threats (McCleary & Hay, 1980; McDowall et al., 1980).

    Instrumentation is also a plausible threat to planned interventions because new methods for measuring the outcome variable often accompany the introduction of other changes. Figure 32.7 (from McCleary, Nienstedt, & Erven, 1982) presents a monthly plot of Tucson burglary counts. For 2 years beginning in 1979, detectives replaced uniformed officers in performing burglary investigations. In 1981, the investigative responsibility was returned to the uniformed officers. Consistent with the notion that detectives are more proficient in pre-

    Daily "Talking Out" Incidents

    25

    20

    15

    10

    5

    Before After

    FIGURE 32.8. Daily disruptions caused by talking out before and after a behavioral intervention.

    Time-Series Designs

    venting burglaries, the counts were lower when they handled the cases.

    Although the switching intervention feature of the design effectively rules out history, all other internal validity threats are still plausible. Additional analysis showed that detectives and uniformed offi-cers did not keep records in the same way, and this difference reduced the number of burglaries that the detectives recorded. Allowing for the influence of the instrumentation change, burglary counts did not vary significantly with the type of officer responsible for the investigations.

    Construct Validity Shadish et al. (2002) identified 14 "reasons why inferences about the constructs that characterize study operations may be incorrect" (p 73). One of the 14 threats to construct validity, novelty and disruption, is relevant to experimental and quasi-experimental time-series designs. Regardless of whether an inter-vention has its intended effect, the time series is likely to react to the novelty or disruption associated with it. If the general form of the artifact is known, it can be incorporated into the statistical model.

    Figure 32.8 illustrates one aspect of this threat to construct validity. Hallet al. (1971) counted the number of talking out disruptions in a classroom for 20 consecutive days. When a behavioral interven-tion is implemented on the 21st day, the time series changes gradually, falling to a lower daily level of disruption. If the gradual nature of the response is not taken into account, the effectiveness of the inter-vention is underestimated. Because gradual responses to interventions are a common feature of behavioral research, the uncontrolled threat to con-struct validity can have serious consequences.

    Figure 32.9 illustrates the complementary aspect of this threat to construct validity. Similar to Figure 32.3, these data are daily counts of self-injurious behavior incidents but for a different patient (McCleary et al., 1999). Instead of receiving an opiate-blocker, beginning on the 26th day, this patient received a placebo. The level of the time series dropped immediately but then, within a few days, returned to its preintervention level. Placebo effects of this sort are common in time-series experiments. Given a well-behaved time-series process and a

    621

  • r McCleary and McDowall

    Self-Injurious Behavior Incidents

    8

    7

    6

    5

    4

    3

    2

    0 25 Preintervention Days 25 Postintervention Days

    FIGURE 32.9. Self-injurious behavior incidents for a single institutionalized patient before and after a placebo regimen.

    sufficient number of postintervention observations, the threat to construct validity may be ignored. Under more realistic circumstances, however, the placebo effect must be incorporated into the analytical model.

    Campbell and Stanley (1963, p. 43) recognized the threat to validity posed by a dynamic response to an intervention; still, their external validity assess-ment of the time-series quasi-experiment seemed to leave the issue open. Addressing the same threat from a modeling perspective, Box and jenkins (1970; Box & Tiao, 1975) proposed a lagged poly-nomial parameterization of h(Xt) that allows for hypothesis testing. The polynomial lag makes the ARIMA model inherently nonlinear, complicating the interpretation of analytic results. The polyno-mial lag provides a straightforward method of con-trolling novelty and disruption threats to construct validity, however, and has become widely accepted in the social sciences.

    Figure 32.10 shows four variations of the same general polynomial lag model. The model variations in the top row depict permanent responses to the intervention. The series may respond to the inter-vention instantaneously or gradually but, in either case, the response is permanent. A gradual, perma-nent response model seems to capture the effect of the behavioral intervention on the daily time series of disruptive talking out incidents (Figure 32.8).

    622

    The model variations in the bottom row of Fig-ure 32.10 depict temporary responses to the inter-vention. Both responses model placebo artifacts, spiking at the onset of the intervention but then decaying over time to reveal the long-run effect of the intervention or treatment. The gradual, tempo-rary response model seems to capture the effect of a placebo intervention on the daily time series of self-injurious behavior (Figure 32.9).

    Permanent and temporary responses can be com-bined in a model. Figure 32.11 shows a time series of divorce rates for Australia before and after the 1975 Family Law Act, which allowed for no-fault divorce. Opponents of the act argued that its no-fault provisions would lead to an increase in divorces. An evaluation of the act by the Australian government found that although divorces did rise following the act, the divorce rate fell back to its pre-1975level after 3 years. The fact that post-1975 divorce rates were higher was attributed to secular trend.

    Reanalyzing these data, McCleary and Riggs (1982) hypothesized a complex response to the act, realized as the sum of a permanent and a temporary increase in divorce. The temporary spike in divorces decayed rapidly in the years immediately following 1975. Divorces never returned to their pre-1975 rates, how-ever, and instead stabilized at a new higher level.

  • Time-Series Designs

    Percentage of Effect Percentage of Effect

    -50

    -100

    Percentage of Effect Percentage of Effect

    FIGURE 32.10. Model responses to an intervention. The top row illustrates permanent response patterns. The bot-tom row illustrates temporary response patterns.

    Whether or not they have permanent effects, new laws often have temporary effects that are well mod-eled as decaying spikes. Failure to allow for these temporary effects can lead to invalid inferences. At the individual level, placebo artifacts pose an analo-gous threat to construct validity. Although these threats are easily controlled with an explicit com-plex response model, by allowing the possibility of

    Divorces 80

    70

    60

    50

    40

    30

    20

    10

    0 50 60

    Family Law Act of 1975

    70 Year

    80 90 95

    FIGURE 32.11. Australian divorces before and after the 1975 Family Law Act.

    several responses, the model raises a potential threat to statistical conclusion validity: fishing. To control the fishing threat, the complex response model must be fully specified before any hypothesis test.

    Finally, we return to the trade-off implicit in the four-validity system. In principle, all nine threats to internal validity can be controlled by manipulating the intervention or treatment experimentally. Inter-nal validity is bought at the expense of construct validity, however, which may be more threatening in single-subject designs. Although the opiate-blocker regimen appears to reduce self-injurious behavior, implementation of the regimen provokes a week-long reaction to the novelty and disruption. Because none of the common threats to internal validity seem plausible, trading construct for inter-nal validity may be unwarranted.

    External Validity A time-series quasi-experiment typically considers an intervention's influence on only one series, and

    623

  • McCleary and McDowall

    this makes it highly vulnerable to external validity threats. External validity considers whether find-ings hold "over variations in persons, settings, treatments, and outcomes" (Shadish et al., 2002, p. 83), and threats to it are always plausible when analyzing a single series. Ruling out external valid-ity threats necessarily requires replicating the quasi-experiment over a diverse set of conditions.

    The evaluation research literature shows many cases in which effect estimates exist to sup-port every possible conclusion about a program's impact. Evaluations of gun control policies, for example, include numerous instances of positive, negative, and null effect estimates (Reiss & Roth, 1993, pp. 255-287). In these situations, the vari-ance of the effects across replications can be more informative than is a single-point estimate of the average effect.

    If several quasi-experimental replications exist, they allow external validity to be assessed in one of two ways. First, the set of individual time series can be assembled to form a single vector series, Y1:

    (17)

    A statistical analysis can then take advantage of the variation in the replications to obtain estimates of both the average impact and its expected variability (e.g., McGaw & McCleary, 1985). Second, an analy-sis can decompose the set of individual impact estimates,

    (18) into components associated with various external validity threats.

    The second approach is a restricted case of the first, and in theory, it is less desirable. Still, the first and more general model makes highly demanding assumptions, and time-series often do not conform to them. Because the second approach places fewer requirements on the data, it is therefore generally more practical to apply. Statistical models for the second approach come from meta-analysis and divide the effect variance into components caused by the setting, intervention, and other potential threats to external validity. McDowall, Loftin, and Wiersema (1992) used this approach to estimate the overall impact and variability of sentencing laws,

    624

    and McCleary (2000) used it to combine estimates from the Hawthorne experiments.

    DYNAMIC INTERVENTION ANALYSES

    A salient advantage of time-series designs over other before-after designs is the capability of modeling dynamic responses to the intervention. Proper specifi-cation of a dynamic response model, such as those shown in Figure 32.10, requires a parsimonious the-ory of the response. One theory that has proved use-ful in psychological research restricts the response to one of four types defined by dichotomizing onset and duration. The response may be abrupt or gradual in onset and permanent or temporary in duration. Although general transfer function specifications (Box & jenkins, 1970; Box & Tiao, 1975) allow a wider range of responses, the four-type theory is more realistic for psychological interventions and more appropriate for testing intervention null hypotheses.

    The talking out intervention plotted in Figure 32.8 is a typical gradual, permanent response to an intervention. Though implemented on the 21st day, the full effect of the intervention is not real-ized immediately but, rather, accumulates gradu-ally over several days. If a time-series model is written as the sum of stochastic and intervention components,

    (19)

    The stochastic component plays no meaningful role in our explication of the intervention analysis. Pro-cedures for building statistically adequate ARIMA models of j(a1) are described elsewhere (McCleary & Hay, 1980) but are of little interest here. Subtracting the stochastic component from the series

    (20)

    leaves the dynamic intervention component:

    (21)

    The simplest dynamic model of a gradual, perma-nent response to xt is

    (22)

  • ,.. Time-Series Designs

    TABLE 32.2

    Parameter Estimates for Dynamic Intervention Models

    Gradual response, permanent change

    g(X1) = XrW (1 - oB)-1 Talking out (Figure 32.8) co= -6.79 co)= -6.81 o = 0.56 o) = 8.48

    where B is a backward lag operator defined such that

    BkXt = Xt-k for any integer k.

    A Taylor series expansion of the right-hand side yields the more useful series identity:

    (23)

    2t = X1ro (1 + 8B + o2B2 + o3B3 + ... ) (24) = X1ro + oxt-1 ro + o2 xt-2ro + o3 xt-3ro + ... (25)

    X1 is defined as a binary variable such that

    xt = 0 for preintervention days t=1, ... ,20 (26) xt = 1 for postintervention days t=21, ... ,40. (27)

    Before the intervention, X1$ 20 = 0, so

    Thereafter, X1>20 = 1. On the jth postintervention day,

    (28)

    220+j = ro + oro+ o2ro + ... + oi-lro. (29) Because o is a fraction, oH is a very small number.

    Parameter estimates for the talking out time-series are reported in the left-hand panel of Table 32.2.5 Substituting the estimates of ro and o into the series identity,

    221 = -6. 79(1) = -6.79 (30) 2 22 = -6.79(1 +.56)= -10.59 (31) 223 = -6.79(1 +.56+ .31) = -12.72 (32) 224 = -6.79(1 +.56+ .31 + .18) = -13.91 (33) 225 = -6.79(1 +.56+ .31 + .18 + .10) = -14.58.

    (34)

    Daily changes in talking out continue throughout the postintervention segment, but reductions 'Parameters estimated with the SCA Statistical System (Liu, 1999).

    Abrupt response, temporary change g(Xr) = (1 - B)XrW (1 - 88)-1 Self-injurious behavior (Figure 32.9) co= -3.65 co)= -3.04 o = 0.57 o) = 3.02

    become smaller and smaller. Eventually, the effect will converge on

    -6.79 I (1- .56)= -15.43, (35) but by the end of the fifth postintervention day, 95% of this effect has been realized.

    The self-injurious behavior time series in Figure 32.9 presents a typical abrupt, temporary response to an intervention that, in this case, is a placebo. On the first postintervention day, this patient's rate of self-injurious behavior drops abruptly but, in subsequent days, returns to its preintervention level. The sim-plest dynamic model of an abrupt, temporary effect is

    This model has the series identity:

    2t = vxtro + o vxt-1ro + o2 vxt-2ro + 83 vxt-3ro + ...

    (36)

    (37)

    Whereas X1 remains on throughout the postinter-vention period, the first difference of Xr, V'X1, turns on in the first postintervention day and then turns off again:

    vxt = 0 for days t = 1, ... '25 (38) vxt = 1 fort= 26 (39) vxt = 0 for days t = 27, ... '50. (40)

    Before the intervention, V'X1$ 25 = 0 and

    On the first day of the intervention, V'X26 = 1, so

    226=(1). (42) But thereafter, V'Xr,26 = 0 again, and

    226+j = oi-lro. C 4 3)

    625

  • McCleary and McDowall

    Again, as ro&-1 approaches 0, the placebo effect decays.

    Parameter estimates for the placebo self-injurious behavior time series are reported in the right-hand column of Table 32.2. Substituting the estimates of ro and o into the series identity,

    z26 = -3.04,

    z27 = -3.04 (.57)= -1.73, Z2s = -3.04 (.57)2 = -0.99, z29 = -3.04 (.57)3 = -0.56, and Z30 = -3.04 (.57) 4 = -0.32.

    (44) (45) (46) (47) (48)

    By end of the fifth postintervention day, 90% of the placebo effect has dissipated.

    In either of these two dynamic models, the parameter o determines the rate of postintervention change in the time series. Intervention null hypothe-ses can be devised around the value of 8 to test properties of the response.

    CONCLUSION

    Time-series data and time-series designs have a long history in psychological research. Of the many uses of time-series data, causal inferences from experi-ments and quasi-experiments are currently their wid-est application. Given a reasonably long time series, balanced data, and an adequate ARIMA model, the time-series quasi-experiment is among the most use-ful and valid quasi-experimental designs.

    The advantages of the time-series quasi-experiment are especially apparent in the absence of naturally defined control groups. To emphasize this property, Campbell and Stanley (1963) cited the hypothetical example of a chemist who, dipping an iron bar into nitric acid, attributes the bar's loss of weight to the acid bath: "There may well have been 'control groups' of iron bars remaining on the shelf that lost no weight but the measurement and report-ing of these weights would typically not be thought necessary or relevant" (p. 3 7).

    The design is also vulnerable to multiple threats to validity, of course, and one would normally not use it in situations in which randomized controlled trials are possible. These cases aside, the time-series

    626

    quasi-experiment is a feasible and relatively strong design across a wide range of circumstances.

    References Beveridge, W. H. (1922). Wheat prices and rainfall

    in western Europe. journal of the Royal Statistical Society, 85,412-475. doi:10.2307/2341183

    Box, G. E. P., &:Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San Francisco, CA: Holden-Day.

    Box, G. E. P., &: Tiao, G. C. (1975). Intervention analysis with applications to economic and environmen-tal problems. Journal of the American Statistical Association, 70, 70-79. doi:10.2307/2285379

    Campbell, D. T. (1963). From description to experimen-tation: Interpreting trends as quasi-experiments. In C. W. Harris (Ed.), Problems in measuring change (pp. 212-243). Madison: University of Wisconsin Press.

    Campbell, D. T., &: Ross, H. L. (1968). The Connecticut crackdown on speeding: Time series data in quasi-experimental analysis. Law and Society Review, 3, 33-53. doi:10.2307/3052794

    Campbell, D. T., &: Stanley,]. C. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand-McNally.

    Chree, C. (1913). Some phenomena of sunspots and of terrestrial magnetism at Kew Observatory. Philosophical Transactionsof the Royal Society of London, Series A, 212, 75-116. doi: 10.1098/ rsta.19l3.0003

    Cohen,]. (1988). Statistical power analysis for the behav-ioral sciences (2nd ed.). Englewood Cliffs, N]: Erlbaum.

    Cook, T. D.,&: Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton-Mifflin.

    Dollard,]., Doob, L. W., Miller, N. E., Mowrer, 0. H.,&: Sears, R. R. (1939). Frustration and aggression. New Haven, CT: Yale University Press.

    Elton, C. S. (1924). Fluctuations in the numbers of animals: Their causes and effects. journal of Experimental Biology, 2, 119-163.

    Elton, C.,&: Nicholson, M. (1942). The ten-year cycle in numbers of the lynx in Canada. Journal of Animal Ecology, 11,215-244. doi:l0.2307/l358

    Fisher, R. A. (1921). Studies in crop variation: An exami-nation of the yield of dressed grain from Broadbalk. journal of Agricultural Science, 11, 107-135. doi:lO.l017/S0021859600003750

    Fisher, R. A. (1925). Statistical methods for research work-ers. London, England: Oliver &: Boyd.

  • Florence, P. S. (1923). Recent researches in indus-trial fatigue. Economic journal, 33, 185-197. doi:10.2307/2222844

    Franke, H. F., & Kaul,]. D. (1978). The Hawthorne experiments: First statistical interpretation. American Sociological Review, 43, 623-643. doi:10.2307/2094540

    Glass, G. V., Willson, V. L., & Gottman,]. M. (1975). Design and analysis of time series experiments. Boulder: Colorado Associated University Press.

    Granger, C. W.]. (1969). Investigating causal rela-tionships by econometric models and cross-spectral methods. Econometrica, 37, 424-438. doi:10.2307/19l2791

    Greene, W. H. (2000). Econometric analysis (4th ed.). Englewood Cliffs, NJ: Prentice-Hall.

    Hall, R. V., Fox, R., Willard, D., Goldsmith, L., Emerson, M., Owen, M., ... Porcia, E. (1971). The teacher as observer and experimenter in the modification of disputing and talking-out behaviors. journal of Applied Behavior Analysis, 4, 141-149. doi:10.1901/ jaba.197l. 4-141

    Hennigan, K. M., Del Rosario, M. L., Heath, L., Cook, T. D., Wharton,]. D., & Calder, B.]. (1982). Impact of the introduction of television on crime in the United States: Empirical findings and theoretical implica-tions. journal of Personality and Social Psychology, 42, 461-4 77. doi: 10.1037/0022-3514.42.3.461

    Hepworth,]. T., & West, S. G. (1988). Lynchings and the economy: A time-series reanalysis of Hovland and Sears. journal of Personality and Social Psychology, 55, 239-24 7. doi:10.1037/0022-3514.55.2.239

    Hovland, C. I., & Sears, R. R. (1940). Minor studies of aggression IV. Correlation of lynchings with eco-nomic indices. journal of Psychology, 9, 301-310. doi: 10.1080/00223980.1940.9917696

    Kendall, M., & Stuart, A. (1979). The advanced theory of statis-tics (4th ed., Vol. 2). London, England: Charles Griffin.

    Kroeber, A. L. (1919). On the principle of order in civilization as exemplified by changes of fashion. American Anthropologist, 21, 235-263. doi: 10.1525/ aa.1919.2l.3.02a00010

    Kroeber, A. L. (1944). Configurations of cultural growth. Berkeley: University of California Press.

    Lipsey, M. (1990). Design sensitivity: Statistical power for experimental research. Thousand Oaks, CA: Sage.

    Liu, L.-M. (1999). Forecasting and time series analy-sis using the SCA statistical system. Villa Park, IL: Scientific Computing Associates.

    McCleary, R. (2000). Evolution of the time series experi-ment. In L. Bickman (Ed.), Research design: Donald Campbell's legacy (pp. 215-234). Thousand Oaks, CA: Sage.

    Time-Series Designs

    McCleary, R., & Hay, R. A. ,Jr. (1980). Applied time series analysis for the social sciences. Beverly Hills, CA: Sage.

    McCleary, R., Nienstedt, B. C., & Erven,]. M. (1982). Uniform crime reports and organizational outcomes: Three time series quasi-experiments. Social Problems, 29, 361-372. doi: 10.1525/sp.1982.29.4.03a00030

    McCleary, R., & Riggs,]. E. (1982). The 1975 Australian Family Law Act: A model for assessing legal impacts. New Directions for Program Evaluation,16, 7-18.

    McCleary, R., Touchette, P., Taylor, D. V., & Barron, ]. L. (1999, March). Contagious models for self-injurious behavior. Poster presentation, 32nd Annual Gatlinburg Conference on Research and Theory in Mental Retardation, Charleston, SC.

    McDowall, D., Loftin, C., & Wiersema, B. (1992). A comparative study of the preventive effects of man-datory sentencing laws for gun crimes. journal of Criminal Law and Criminology, 83, 378-394. doi:10.2307/ll43862

    McDowall, D., McCleary, R., Meidinger, E. E., & Hay, R. A. ,Jr. (1980). Interrupted time series analysis. Beverly Hills, CA: Sage.

    McGaw, D. B., & McCleary, R. (1985). PAC spend-ing, electioneering, and lobbying: A vector ARIMA time series analysis. Polity, 17, 574-585. doi:10.2307/3234659

    Mills, T. C. (1991). Time series techniques for economists. New York, NY: Cambridge University Press.

    Neyman,]., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika, 20A, 175-240.

    Reiss, A.]., & Roth,]. A. ( 1993). Understanding and preventing violence. Washington, DC: National Academies Press.

    Richardson,]., & Kroeber, A. L. (1940). Three centuries of women's dress fashions: A quantitative analysis. Anthropological Records, 5, 111-153.

    Roethlisberger, F.,&: Dickson, W.]. (1939). Management and the worker. Cambridge, MA: Harvard University Press.

    Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for general-ized causal inference. New York, NY: Houghton Mifflin.

    Wolf,]. R. (1848). Nachrichten uber die Sternwarte in Bern [News from the observatory in Berne]. Mittheilungen der naturforschenden gese!lschaft in Bern, Nr. 114-115. ETH-Bibliothek Zurich, Rar 4201. doi:10.393l/e-rara-2007

    Yule, G. U. (1927). On a method of investigating peri-odicities in in disturbed series with special refer-ence to Wolfer's sunspot numbers. Philosophical T ransactionsof the Royal Society of London, Series A, 226, 267-298.

    627