Influence of Irrelevant Speech on Standardized Test Performance

ResearchReports REPORT 68

APRIL 2002TEST OF ENGLISH AS A FOREIGN LANGUAGETM

Influence of IrrelevantSpeech on StandardizedTest Performance

Donald E. PowersWendy AlbertsonThomas FlorekKathy JohnsonJohn MalakBill NemceffMark PorzucDonna SilvesterMinhwei WangRichard WestonEdward WinnerAleksander Zelazny

Influence of Irrelevant Speech on Standardized Test Performance

Donald E. Powers Wendy Albertson Thomas Florek Kathy Johnson

John Malak Bill Nemceff Mark Porzuc

Donna Silvester Minhwei Wang Richard Weston Edward Winner

Aleksander Zelazny

Educational Testing Service Princeton, New Jersey

RR-02-06

Educational Testing Service is an Equal Opportunity/Affirmative Action Employer.

Copyright © 2002 by Educational Testing Service. All rights reserved.

No part of this report may be reproduced or transmitted in any form or by any means,electronic or mechanical, including photocopy, recording, or any information storageand retrieval system, without permission in writing from the publisher. Violators willbe prosecuted in accordance with both U.S. and international copyright laws.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logos, Graduate RecordExaminations, GRE, TOEFL, and the TOEFL logo are registered trademarks ofEducational Testing Service. The Test of English as a Foreign Language and ThePraxis Series are trademarks of Educational Testing Service.

College Board is a registered trademark of the College Entrance Examination Board.

Graduate Management Admission Test and GMAT are registered trademarks of theGraduate Management Admission Council.

®

®®

i

Abstract

The aims of this study were to (1) estimate the likely degree of any such distraction as

well as its impact on test performance and (2) evaluate the prospects of reducing the distraction

to a level that is acceptable to test takers.

Study participants were volunteers (N = 171) who had previously taken the Graduate

Management Admission Test® (GMAT®), the Graduate Record Examinations® (GRE®) General

Test, or the Test of English as a Foreign Language™ (TOEFL®). They were invited to retake a

different form of the same test under either distracting conditions or standard, distraction-free

conditions. To reduce distraction, some participants used either headsets or headsets plus

masking noise.

Attempts to reduce distraction to an acceptable level were largely unsuccessful. The

impact on actual test performance, however, was slight in the GMAT sample and negligible in

both the GRE and TOEFL samples.

The conclusion was that intermingling examinees with others who are taking a speaking

test remains a concern, primarily because of strong negative perceptions by test takers. More

effective means need to be devised to reduce or control distraction.

Key words: distraction, standardized testing, speaking tests, test validity

ii

The Test of English as a Foreign LanguageTM (TOEFL®) was developed in 1963 by the NationalCouncil on the Testing of English as a Foreign Language. The Council was formed through thecooperative effort of more than 30 public and private organizations concerned with testing the Englishproficiency of nonnative speakers of the language applying for admission to institutions in the UnitedStates. In 1965, Educational Testing Service® (ETS®) and the College Board® assumedjoint responsibility for the program. In 1973, a cooperative arrangement for the operation of theprogram was entered into by ETS, the College Board, and the Graduate Record Examinations® (GRE®)Board. The membership of the College Board is composed of schools, colleges, school systems, andeducational associations; GRE Board members are associated with graduate education.

ETS administers the TOEFL program under the general direction of a policy board that was establishedby, and is affiliated with, the sponsoring organizations. Members of the TOEFL Board (previously thePolicy Council) represent the College Board, the GRE Board, and such institutions and agencies asgraduate schools of business, junior and community colleges, nonprofit educational exchangeagencies, and agencies of the United States government.

✥ ✥ ✥

A continuing program of research related to the TOEFL test is carried out under the direction of theTOEFL Committee of Examiners. Its 13 members include representatives of the TOEFL Board, anddistinguished English as a second language specialists from the academic community. The Committeemeets twice yearly to oversee the review and approval of proposals for test-related research and to setguidelines for the entire scope of the TOEFL research program. Members of the Committee ofExaminers serve three-year terms at the invitation of the Board; the chair of the committee serves onthe Board.

Because the studies are specific to the TOEFL test and the testing program, most of the actual researchis conducted by ETS staff rather than by outside researchers. Many projects require the cooperationof other institutions, however, particularly those with programs in the teaching of English as a foreignor second language and applied linguistics. Representatives of such programs who are interested inparticipating in or conducting TOEFL-related research are invited to contact the TOEFL programoffice. All TOEFL research projects must undergo appropriate ETS review to ascertain that dataconfidentiality will be protected.

Current (2001-2002) members of the TOEFL Committee of Examiners are:

Lyle Bachman University of California, Los AngelesDeena Boraie The American University of CairoMicheline Chalhoub-Deville (Chair) University of IowaJodi Crandall (Ex Officio) University of Maryland, BaltimoreCatherine Elder University of AucklandGlenn Fulcher University of SurreyWilliam Grabe Northern Arizona UniversityStan Jones Carleton UniversityKeiko Koda Carnegie Mellon UniversityRichard Luecht University of North Carolina at GreensboroTerry Santos Humboldt State UniversityMerrill Swain The University of TorontoRichard Young University of Wisconsin-Madison

To obtain more information about TOEFL programs and services, use one of the following:

Email: [email protected]

Web site: http://www.toefl.org

iii

Acknowledgments

Many people contributed in various ways to the implementation of the study reported here.

Contributions included offering advice on the design of the study, recruiting participants,

collecting data, processing data, retrieving data from test score files, administering examinations,

preparing and reviewing the report, and many other tasks. We are grateful to all of the following

people for the parts they played:

Tedi Adams, Jane Burtis, Clara Bowman, Brent Bridgeman, Jim Butera,

Pat Carey, Regina Carola, Amy Chen, Rob Colson, Sandy Cool, Karen Copper,

Lisa Costas, Kevin Cureton, Carol Dwyer, Mike Ecker, Dan Eignor, Cindy Evans,

Lauren Florczak, Mark Grant, Fanmin Guo, Erin Herbert, Bob Hill, Rick Kling,

Anna Kubiak, Maria Leszczyszyn, Jack McDonald, Steven Nelson,

Denise Nevrincean, Cindy Nguyen, Roxanna Paez, Liane Patsula, Gary Payne,

Robert Peterla, Karen Potsko, Meg Powers, Diane Rein, Alan Rushforth,

Aleta Sclan, Ed Shea, Dave Vale, Jeff Wright, and Ruth Yoder

iv

Table of Contents

Page

Introduction ...................................................................................................................................1

Background .........................................................................................................................1

Relevant Literature ..............................................................................................................1

Purpose of the Study ...........................................................................................................4

Objectives ...........................................................................................................................5

Method ...........................................................................................................................................5

Participants ..........................................................................................................................5

Testing Facility ...................................................................................................................6

Tasks ...................................................................................................................................6

Study Instruments ...............................................................................................................7

Procedure ............................................................................................................................8

Design .................................................................................................................................9

Analysis .............................................................................................................................10

Results ..........................................................................................................................................12

Reliability of Study Measures ...........................................................................................12

Study Sample ....................................................................................................................12

Effect of the Experimental Manipulation .........................................................................13

Effort/Motivation and Test Anxiety ..................................................................................16

Effects on Test Performance .............................................................................................18

Examinee Reactions to Method Used to Block Sound .....................................................22

Observations and Debriefings ...........................................................................................22

Discussion .....................................................................................................................................23

Limitations ........................................................................................................................24

Conclusions ..................................................................................................................................25

References ....................................................................................................................................26

Notes .............................................................................................................................................28

Appendixes ...................................................................................................................................30

v

List of Tables

Page

Table 1. Reliability Estimates for Self-Report Scales .................................................................12

Table 2. Distribution of Study Sample by Test Sample and Testing Condition .........................13

Table 3. Degree of Distraction by Test-Taker Sample and Testing Condition ...........................14

Table 4. Effect of Testing Condition on Degree of Distraction ..................................................16

Table 5. Means and SDs of Anxiety Scale Scores by Sample and Testing Condition ...............17

Table 6. Means and SDs of Effort Scores by Sample and Test Occasion ...................................18

Table 7. Test Performance by Testing Condition .......................................................................19

Table 8. Summary of Degree of Influence of Several Extraneous Sources of Variation in Test Performance ......................................................................................................21

1

Introduction Background

Currently, some examinees who take ETS-administered computer-based tests do so at the

same time that other test takers take different tests in the same testing center. In the future, some

of these tests will contain speaking components. Under these conditions, it is possible that the

speech generated by some test takers will be distracting to other test takers and also, possibly,

detrimental to the test performance of the latter. It is this concern that motivated the study

reported here.

Relevant Literature

There is substantial literature on the effects of noise on both attitudes and performance.

Much of this research, however, is only partially relevant to our interests. Nonetheless, even a

cursory review suggests that our concern is a legitimate one.

Context of Prior Research. A preponderance of the research on noise has focused on

hearing and auditory performance in occupational settings, where the effects (interference and

annoyance) are well known. The effects of noise on cognitive performance, however, remain

“controversial and far from conclusive” (Mital, McGlothlin, & Faard, 1992, p. 70). Moreover,

decades of research have produced “only fragmentary support” for the belief that distracting

noises can seriously affect performance on intellectually demanding tasks (Weinstein, 1977,

p. 104). Consistent with these views is the fact that efforts to reduce noise have been mostly in

production workplaces, where researchers have shown that, even at low levels, noise can have

negative effects on concentration, productivity, working capacity, and the likelihood of accidents

(Sailer & Hassenzahl, 2000). Perhaps less focus has been on other settings, such as school and

office workplaces, because low sound levels are seldom harmful — at least in the same sense as

they are in occupational settings.

Furthermore, not all of the research on cognitive performance is entirely relevant to our

concern, in large part because much of it has taken place in laboratory settings using noise

generated by percussion-type equipment. Some studies have also been conducted in more

naturalistic settings, however. For example, Ng (2000) studied the effects of building

construction on dormitory residents, finding that behavior and attitudes (about studying) were

2

most affected for residents nearest the construction. Although the impact of such naturally

occurring distractions has been reasonably well studied, the effects of the kind of distraction of

most interest to us (i.e., conversational noise, or speech) have not been as well documented

(Mital, McGlothlin, & Faard, 1992).

Tasks Studied in Previous Research. Neither has the research on noise focused on exactly

the kinds of standardized test tasks that are of most interest to us. Although this research has

employed a variety of different types of cognitive tasks, few if any of these tasks correspond

closely with the kinds of test questions that comprise the large-scale national tests administered

by ETS. Much of this research has focused on quite simple tasks because, as Weinstein (1974)

pointed out, earlier research on complex activities — like reading comprehension and

performance on intelligence tests — yielded mostly inconclusive results. Research using

“realistic noise levels and meaningful tasks” has been “uncommon” (Weinstein, 1974, p. 553).

Several investigations have focused on relatively simple tasks, such as proofreading. For

example, Weinstein (1974) studied the effects of noise on a proofreading task, finding effects on

subjects’ ability to detect grammatical, but not spelling, errors. Later, the same researcher

concluded that the skills required to detect spelling errors were “relatively immune” (Weinstein,

1977, p. 106) to disruption (a tape recording of radio news) but that the more complex processes

involved in detecting contextual errors (e.g., grammatical errors) were affected.

Proofreading has also been used in research on the effects of irrelevant speech on task

performance. In a series of studies, Jones, Miles, and Page (1990) noted negative effects, largely

when irrelevant speech was meaningful. Contrary to Weinstein’s (1974, 1977) findings, the

effects were greater when participants were asked to detect noncontextual errors (e.g., spelling)

than when they were asked to locate contextual ones (e.g., grammatical errors). In their summary

of research on the effects of irrelevant speech, Jones and Morris (1992) focused mainly on visual

tasks, rather than auditory ones. Asking what types of sound interfere with different kinds of

tasks, they concluded that interference is “quite considerable” in some circumstances, but

essentially nonexistent in others (p. 30).

With regard to other kinds of tasks, Kjellberg (1990) noted that reaction time and

vigilance tasks have been studied quite often. In contrast, research on the effects of noise on

“counting” is relatively rare, but seems to indicate that irrelevant speech has “no effect” (Martin,

Wogalter, & Forlano, 1988).

3

Smith (1985) studied a semantic processing task (requiring participants to indicate

whether sentences were correct or not) and a syntactic reasoning task (requiring participants to

verify statements about the order of letters presented). The results showed that continuous noise

had no effect on either task, but that intermittent, unpredictable noise did have an effect on

semantic processing. Finally, Kjellberg and Skoldstrom (1991) studied several different kinds of

tasks involving proofreading, finger dexterity, and complex reaction time and concluded that

“task differences probably explain only a small part of the widely differing noise tolerance levels

at different work places” (p. 39).

Research on Individual Differences. Besides research on different kinds of tasks, some

(but relatively little) research has focused on differences among individuals in their response to

noise. In one such effort, Weinstein (1978) used a self-report noise sensitivity scale (e.g., “At

movies, whispering and crinkling candy wrappers disturb me” and “Even music I normally like

will bother me if I’m trying to concentrate”). Because the scale predicted the reactions of noise-

sensitive and noise-insensitive students, Weinstein concluded that sensitivity to noise is a

personal trait “of sufficient power and generality to permit predictions of reactions to

environments” (p. 464).

In other related research, striking differences have been noted among individuals with

regard to the extent to which they find noise to be annoying and the degree to which it affects

their performance. However, in general this research has not been very successful in identifying

the particular characteristics of people most annoyed by noise (Jones & Davies, 1984). An

exception is that people with hearing impairments may express more annoyance in response to

noise than others. Thus, the effects of noise may depend both on the nature of the task being

performed and on the characteristics of the person performing the task.

Types of Noise Studied. In addition, as intimated earlier, the effects of noise may also

depend on various features of the noise itself. For example, Kjellberg, Landstrom, Tesarz,

Soderberg, and Akerlund (1996) concluded that it is easier to habituate to constant noise than to

variable noise, and therefore constant noise is generally less annoying than variable noise.

Kjellberg et al. also found that distraction was most closely related to the predictability of noise

and the degree to which listeners had control over it. Jones (1984) found that a major

determinant of the effect of noise on performance was the magnitude of the level of change in

noise — in either direction.

4

Jones, Miles, and Page (1990) found that the negative effects of irrelevant speech

depended on the speech being meaningful. With studies of meaningless noise, the main interest

has been the effects of intensity. For instance, the intensity of continuous white noise has been

studied widely for its influence on environmental stress and information processing (Smith,

1985). Intensity has been shown not to be a factor in studies of irrelevant speech, however

(Jones, Miles, & Page, 1990). Also, speech may be more disruptive when it is spoken in the

listener’s native language than in a language that the listener does not understand (Martin,

Wogalter, & Forlano, 1988).

Other Factors and Issues. A number of other factors have also been shown to moderate

the effects of noise. Studying the effects of building construction on dormitory residents, Ng

(2000) cited research showing that attitudes toward the source of the noise may moderate the

effects of noise. Those affected may be less negative about some sources of noise. For example,

in his study Ng speculated that because students stood to benefit, they would be less annoyed by

the disturbance resulting from the construction of a new dorm. In other research, Kjellberg et al.

(1996) found that distraction was most closely related to its predictability and the degree of self-

control that was possible to exert over it.

A final issue that has been addressed in research on noise is its measurement. Sailer and

Hassenzahl (2000) pointed out that the subjective experience of annoyance depends on factors

other than the absolute level of noise. Although there are objective measurements of unwanted

sound, “the relationship between objective sound level and subjective annoyance is rather poor”

(p. 1921). Furthermore, Kjellberg et al. (1996) noted that technical measures of noise explain

only a small portion of variation in annoyance. For example, sound level has been shown to be

only weakly related to annoyance when the noise is irrelevant speech.

Purpose of the Study

The main objective of our study was to determine the influence that distractions from

fellow test takers may have on test performance. The situation of particular interest was one in

which some test takers are taking a speaking test at the same time that others are taking tests that

involve reading, listening, calculating, reasoning, and writing.

5

Objectives

The two main objectives of the study were the following:

(1) to determine the extent to which test takers may be distracted by other test takers

in a computer-based testing environment in which some test takers are trying

tasks that involve speaking

(2) to assess the effectiveness of two methods of reducing distraction due to other test

takers1

Method

Participants

The target sample for the study was test takers who had previously taken either the

GMAT, the GRE General Test, or the TOEFL test. To constitute this sample, we searched the

ETS “data warehouse” files to identify test takers who had tested previously at the University of

Maryland (College Park) computer-based testing center. For GMAT and GRE test takers,

participation was restricted to examinees who had tested within the previous 12 months. For

TOEFL test takers, the restriction was to the previous 6 months, as TOEFL scores are more

likely to change (as a result of experience with the English language) than are GMAT and GRE

scores.

This search yielded approximately 1,200 GRE test takers, 500 GMAT test takers, and 300

TOEFL test takers. “First come, first served” invitations were sent to each of these test takers,

who were asked to return to the test center to retake the test they had taken previously. Each

participant was promised an honorarium of $75. Using a recruitment flyer that we developed for

their use, the CBT test center staff at the university also solicited test takers for the study. In

addition to these efforts, information about the study was posted at George Mason University in

Fairfax, VA, the University of Maryland, Baltimore County, and both the graduate school and

the graduate business management school at the University of Maryland, College Park. These

efforts resulted in a total of 171 study participants, whose operational test scores were very

similar on average to those of the three test-taking populations.

6

Testing Facility

The testing site was a standard ETS computer-based testing center, centrally located on

the campus of the University of Maryland in College Park. This testing center, which houses 10

computer workstations, has been in operation since May 1999 and administers computer-based

tests for the GMAT, GRE, TOEFL, and The Praxis Series™ testing programs.

The center meets all operational specifications for ETS computer-based testing centers,

and as such it is deemed to be fully compliant with all technical and administrative procedures

required for the administration of “high-stakes” computer-based tests. These procedures include

all security practices as defined in the ETS “Policies, Practices, and Procedures” manual. Staff

are fully trained and certified by ETS.

The 10 workstations are partitioned, and each is equipped with a computer, a monitor, a

mouse, headphones, earplugs, and an adjustable chair. The administrative office is equipped with

a viewing window that enables a proctor to monitor test takers at all times during testing. The

center has a 20-foot ceiling and windows around the top of the room.

A picture showing the arrangement of the testing room is included as Appendix A.

Tasks

The tests used in the study were the following:

• GMAT

• GRE General Test

• TOEFL (computer-based version)

According to test information bulletins, these tests measure (for the GMAT) basic verbal,

mathematical, and analytical writing skills; (for the GRE General Test) verbal, quantitative, and

analytical skills; and (for the TOEFL) the English language proficiency of people whose native

language is not English. Each test is comprised of several different question formats. The verbal

portion of the GMAT includes reading comprehension, critical reasoning, and sentence

completion formats. The GMAT quantitative section includes problem solving and data

sufficiency formats, and the analytical writing section requires test takers to write two 30-minute

essays.

7

The verbal portion of the GRE General Test contains analogy, antonym, sentence

completion, and reading comprehension item formats. The quantitative portion contains

quantitative comparison questions and problem solving questions involving arithmetic, algebra,

and geometry. The analytical section is comprised of two item types — analytical reasoning and

logical reasoning.

The TOEFL listening section requires examinees to answer questions about conversations

they have heard. The reading portion contains short passages and questions that test examinees’

understanding of the passages. The structure portion contains multiple-choice questions that

require the recognition of language that is appropriate for standard written English. The writing

section requires test takers to compose a 30-minute essay.

Study Instruments

To devise a measure of distraction during test taking, 42 statements were written on the

basis of a review of the literature on the effects of noise/distraction on performance. These

statements were then reviewed, and the 21 most “face valid” were selected for inclusion in the

study measure. For these “distraction” questions, participants were asked to indicate the extent to

which they agreed or disagreed with such statements as

“I found myself thinking more about the surroundings than the test itself.”

“It was annoying when other people talked during the test.”

Responses were on a 5-point scale (strongly agree, agree, neither agree nor disagree, disagree,

and strongly disagree). A score of 1 to 5 was assigned to each statement, and scores for each

statement were summed to get a total distraction score. Some items were reversed scored as

appropriate so that the greater the distraction, the higher the total score.

Because there was no opportunity to pretest the items, we relied on data collected during

the study proper to document the suitability of the distraction measure. The study data showed

that, when items referred to distraction during the test that students had just taken, all items

correlated at least .50 in absolute value with the total distraction scale score, and 17 of 21

correlated greater than .70. When the statements referred to the previous test that examinees had

taken, 18 of 21 items correlated greater than .50 in absolute value with the total scale score. The

correlation of the weakest item was .27 with the total scale. Thus, the items comprising the

distraction scale were reasonably homogeneous.

8

A second scale (eight statements) was developed to allow participants to indicate (again,

agreeing or disagreeing) how anxious they were about test taking, both for the test they had just

taken and for the operational test they had taken previously. Statements were modeled after those

in a variety of test anxiety inventories (e.g., Sarason, 1984; Spielberger, 1980) and included ones

such as

“I felt tense and unsure.”

“Thinking about how I was doing interfered with my work.”

The correlations of individual anxiety items with the total scale were all greater than .50.

For a third category of statements, participants were also asked to report, again by

agreeing or disagreeing on a 5-point scale, the amount of effort they made (or motivation they

had) for each testing by agreeing or disagreeing with six statements, such as

“I really tried to do my very best.”

“Getting a good score was not important to me.”

For the effort scale, item-total scale correlations were all greater than .40. The relevance of each

of these scales to our study objectives is discussed below.

Procedure

During the last three weeks in July 2001, test takers were scheduled in groups of up to 10

to retake a different form of the same operational test they had taken previously at the center.

GRE, GMAT, and TOEFL test takers were intermingled within the testing sessions, which are

described below. In preparation for the study, test takers were told that, in contrast to the test

they had taken previously, test scores would not count, nor would scores be reported to

institutions. Immediately before taking the test, participants were told only that

the purpose of the study is to compare how test takers perform under standard

operational testing conditions (when test scores count) vs. how they perform

under non-standard conditions when test scores do not count. So, we’d like you

to try your best when you take the test today.

Study participants who were assigned to the distracting conditions (described below) were also

told that

9

one major difference between today’s testing and the test that you took “for

real” previously is that you will hear other test takers taking a speaking test.

We have provided you with headsets to wear when you take the test. These

should help to block out the sound created by those taking the speaking test.

If they are not completely effective, please try to ignore any distractions as

best you can.

To simulate distraction from other examinees taking a speaking test, two additional

workstations were set up within the center to accommodate two compact disk (CD) players.

These CD players were used to play recordings of TOEFL candidates who had participated in the

pilot testing of new item types for a new TOEFL speaking test. The recordings included several

different speaking tasks (reading/speaking, listening/speaking, and independent speaking) so as

to generate a variety of speech as well as a relatively continuous stream of distraction during the

entire testing session. These responses were stored on four CDs created for the study, each

containing the responses of a single respondent from the earlier TOEFL pilot study. Two of the

CDs contained the responses of male test takers; two others contained the responses of female

test takers. Pauses of 5 to 12 minutes were inserted between the individual responses to simulate

the silence that would be observed for someone actually taking the new TOEFL speaking test.

The duration of each pause was determined at random to simulate the kind of intermittent speech

that might be expected during the testing. This intermittent speech was played during the entire

test session to enable a worst-case test of the effects of irrelevant speech on GMAT, GRE, and

TOEFL test taking. The volume, which was standardized for each test session, was set at a level

that project staff judged to approximate that of a typical test taker.

Design2

In total, 28 testing sessions were conducted. During approximately a third of the sessions,

(randomly designated) examinees were tested under standard, distraction-free conditions. For the

remaining two thirds of the sessions (again, randomly designated), examinees were distracted by

the two simulated “examinees.” For half of these “distracting” sessions, test takers were issued

headsets as a means of blocking noise. For the other half, they were provided with headsets that

also enabled them to hear (and adjust the level of) masking noise that was delivered through the

10

headsets. Before the test began, study participants in the masking noise condition were given

time to try different sound levels in order to determine their preferred level during testing.

After each testing session (both the standard sessions and the “distracting” sessions),

participants completed a questionnaire, which included the items that comprised each of the three

scales — distraction, anxiety, and effort — discussed above. Each of the scales sought

perceptions regarding both the test taken for our study (“today’s test”) and the test taken

previously (“previous test”).

Participants were also asked to give their opinions of the suitability and effectiveness of

each of the alternative methods of blocking distractions. Finally, in addition to informal

observations of each testing session by test proctors, we formally observed and documented test

takers’ activities and reactions during one of the testing sessions. Upon completion of the

session, the observer also talked individually with the test takers to gather their impressions.

Analysis

The main analysis was directed toward determining whether distraction had an undue

influence on test performance. More specifically, our interest was the extent to which distraction

could explain variation in participants’ test performance, above and beyond what could be

explained by their performance on the same test taken under standard, distraction-free

conditions. To help gauge the size of any “distraction effect,” we also ran the same analysis

using as explanatory variables (instead of distraction) each of two other variables often believed

to contribute to unwanted, extraneous variation among test takers’ performances. The first was

test-taking effort/motivation, the second was test anxiety. The rationale for using these variables

was that, like distraction, both are potential sources of invalid variation among test scores. Also,

as for distraction, we assumed that each of these variables would have a differential influence on

previous, operational test performance and performance on the test taken for our study. For

example, we assumed that, because there were no consequences for performing poorly, both

anxiety and effort/motivation would, on average, be less for our research study test

administration than for the previously taken operational test. For some of our study participants,

distraction would presumably be greater for our study test than for the operational test, which

was supposedly taken under relatively distraction-free conditions. The inclusion of these

11

variables (i.e., effort/motivation and anxiety) thus served as a baseline, enabling a comparison of

the influence of distraction with the influence of two other common but undesirable potential

sources of test score variance.

Second, we also assessed the effect of distraction on examinee perceptions. This

determination was made by comparing the reactions of participants who were distracted in our

study with those who were not. We also compared participants’ reactions to our distracting

conditions with their reactions to conditions experienced during the previous operational testing.

The reactions of interest were participants’ agreement with such statements as the following:

“Because of noise or other interruptions, my test performance suffered.”

“Any distractions probably didn’t affect my performance much.”

A final analysis focused on the possible differential effectiveness of the two methods of

reducing or controlling distraction.

These analysis objectives were accomplished by using, in the hierarchical fashion

suggested by Cohen and Cohen (1977), ordinary least squares linear regression analysis. For

each test section, test scores earned under distracting conditions were regressed, first on test

scores earned previously under standard, nondistracting conditions. Next, the degree of

distraction reported by participants was added to the regression equation to determine the degree

to which it explained variation in test performance above and beyond that explained by

performance on the previous, operational test taken under standard, distraction-free conditions.

The same analysis was repeated using, in turn, effort/motivation and anxiety (instead of

distraction) as the explanatory variables. In addition, because anxiety was expected to exert a

greater influence for the previous, operational test taken “for real” than for our research test, we

reversed the analysis, this time regressing previous test performance on (1) test performance

during our study and (2) reported anxiety on the previous test. This enabled an assessment of the

extraneous influence of test anxiety in the operational test.

An analysis similar to those described above was undertaken to determine the effect of

testing condition on participants’ perceptions of distraction. In this analysis, the perception of

distraction during our study was regressed on perceptions of distraction during the previous test.

Next, testing condition was added to assess its power to explain perceptions of distraction above

and beyond what was explainable from participants’ perceptions of distraction on the previous,

operational test taken under, presumably, distraction-free conditions.

12

Finally, the degree to which distraction was an influence — both on test performance and

on examinee perceptions — was compared for each method of blocking distractions, that is,

headsets only versus headsets and masking noise.

Results

Reliability of Study Measures

Table 1 provides estimates of the reliability (coefficient alpha) for each of the scales

developed to measure three sources of construct-irrelevant (unwanted) test variance: differential

effort/motivation, test anxiety, and distraction. The estimates have been computed both for

participants’ responses concerning the previously taken operational test, and for the test taken for

our research study. As is clear, the 21-item distraction scale was quite reliable. The 8-item

anxiety scale was somewhat less reliable, with estimates around .80. The estimated reliability of

the 6-item effort/motivation scale was reasonably good with regard to reports of

effort/motivation for the research study test, but somewhat less for the previous, operational test.

Table 1 Reliability Estimates for Self-Report Scales

Test Occasion

Measure Previous Operational Test

Research Test

Distraction

.92

.96

Test anxiety

.81

.76

Effort

.65

.81

Note: Table entries are Cronbach’s coefficient alpha.

Study Sample

Table 2 shows the distribution of GMAT, GRE, and TOEFL test takers by study testing

condition. Because we were unable to assign equal numbers of GMAT, GRE, and TOEFL test

takers to each testing session, the samples were not equally distributed across testing conditions.

13

However, a chi-square test of the frequencies did not detect a statistically significant imbalance,

χ2 (4) = 5.7, n.s.

Table 2 Distribution of Study Sample by Test Sample and Testing Condition

Testing Condition Sample

Distraction With

Headsets

Distraction With Headsets and

Masking Noise

No

Distraction

Total GMAT 8 13 12 33

GRE 25 28 17 70

TOEFL 14 27 27 68

Total 47 68 56 171

X2 (4) = 5.70, n.s.

Effect of the Experimental Manipulation

Table 3 displays, by testing condition and test-taker sample, study participants’ scores on

the distraction scale (higher numbers indicating greater perceptions of distraction). Several

aspects of this table are noteworthy. First of all, the degree of distraction reported by study

participants who were assigned to the no-distraction condition was, on average, comparable to

that reported for the previously taken operational test. There was no statistically significant

difference between these two conditions — that is, the nondistracting condition in our study

versus conditions during the operational test — within any of the three test-taker samples. The

implication is that our standard, nondistracting condition was a good approximation of

conditions during an operational test administration.

14

Table 3 Degree of Distraction by Test-Taker Sample and Testing Condition

Test/Condition Sample GMAT GRE TOEFL Total Research Test Distraction With Headsets M

78.8

80.7

70.1

77.2

SD 10.9 13.7 14.7 144 n 8 25 14 47 Distraction With Headsets and Masking Noise M

79.5

70.1

69.3

71.6

SD 19.9 23.3 17.6 20.9 n 13 28 27 68 No Distraction M

42.5

47.7

44.9

45.2

SD 12.0 18.2 14.3 15.3 n 12 17 27 56 Previous Operational Test M

50.9

47.4

48.7

48.6

SD 17.0 14.3 14.9 15.1 n 32 69 66 167

Note: Entries are scores on the distraction scale.

Secondly, the table suggests few noteworthy differences among the three test-taking

samples with respect to the degree of distraction that they reported.

Thirdly, the reactions of participants in the two distracting conditions differed

dramatically from those in the nondistracting condition. These differences vary somewhat across

conditions and samples, but all can be considered by most standards (e.g., Cohen & Cohen,

1977) to be large, ranging from at least one to nearly three full standard deviation units. The

implication is that our treatment conditions were very successful in generating a noticeable

degree of distraction.

Finally, comparing the two distracting conditions reveals that, over all samples combined,

using masking noise in conjunction with headsets (instead of headsets alone) reduced the amount

of distraction slightly — by about a third of a standard deviation unit on average — an effect that

can be considered “small to medium” (Cohen & Cohen, 1977). This reduction can be considered

15

relatively small also in relation to the large difference remaining between distracted and

nondistracted study participants. Moreover, the effect does not appear to be consistent across

test-taking samples, with virtually all of it occurring within the GRE sample.

Appendix B provides participants’ responses to each of the individual statements

concerning distraction. As is clear, there were noticeably higher reports of distraction on

virtually every item for the two distracted groups when compared to reports for both the

nondistracted study group and the operational test. Chi-square tests revealed statistically

significant differences among groups for each item. Perhaps most telling were the reactions to

the following statement:

“Due to noise or other interruptions, my test performance suffered.”

Only 10% of all study participants agreed or strongly agreed that this statement applied to

their operational testing experience, and only 9% of those assigned to the nondistracting study

condition agreed or strongly agreed that the statement applied to their experience in our study.

However, this statement was endorsed far more frequently by those who were distracted during

our study — by a slight majority (55%) of study participants who used headsets and by a near

majority (45%) of those who also heard masking noise.

Table 4 contains the results of regressing participants’ perceptions of distraction

(regarding the research study test administration) on their perceptions of distraction regarding the

previous, operational test administration. Of interest here is the effect of testing condition (no

distraction, distraction with headset, and distraction with headset and masking noise) over and

above the degree to which distraction can be explained as a consistent individual difference

among test takers (i.e., as indexed by their reports of distraction from the operational test taken

under relatively distraction-free conditions). The results of this analysis show that testing

condition was clearly a significant explanatory variable, above and beyond that reported for the

earlier operational test. This indicates that the treatment conditions did have a statistically

significant effect on participants’ perceptions of distraction for each of the three test-taker

samples. The regression weights indicate that, as suggested by the descriptive statistics presented

earlier, the main differences are between the two distracting conditions and the nondistracting

condition. Only in the GRE sample was there a significant difference between the two distracting

conditions, with the use of masking noise (instead of headsets alone) reducing the degree of

distraction reported.

16

Table 4 Effect of Testing Condition on Degree of Distraction

Explanatory Variable(s) Cumulative

R2 Increase in

R2 F for increase

in R2

df GMAT Sample

1. Distraction Score (Nonoperational Test)

.10

.10

3.5

1, 31

2. Distraction Score, Testing Condition

.61

.50

18.5***

3, 29

GRE Sample 1. Distraction Score (Nonoperational Test)

.18

.18

14.9***

1, 68


.44

.26

14.9***

3, 66

TOEFL Sample 1. Distraction Score (Nonoperational Test)

.15

.15

11.3**

1, 66


.51

.36

23.9***

3, 64

** p<.01, *** p<.001 The conclusion from this analysis is that our experimental simulation of distraction was

successful. The bad news, however, is that participants did react negatively to the resulting

distraction, feeling not only that they were distracted but also that their test performance was

negatively affected.

Effort/Motivation and Test Anxiety

As stated earlier, part of our strategy was to compare the relative influence of distraction

with two other potential sources of unwanted test score variation — test anxiety and test-taking

effort/motivation. Table 5 shows mainly that, as expected, test anxiety ran higher, according to

examinee reports, during the previous, operational test than during the experimental testing that

we conducted. Over all samples, the difference (effect size) was nearly a full standard deviation

unit, a difference that can be considered “large.” Thus, on average, study participants clearly felt

that they experienced more anxiety when they took operational tests than when they took our

research tests. Responses to each of the test-anxiety statements are given in Appendix C. For

17

example, whereas 67% of the total sample agreed or strongly agreed that they “worried a lot”

before taking a previous, operational test, far fewer (9-11%) agreed or strongly agreed that they

did so before taking the tests given during our research study.

Table 5 Means and SDs of Anxiety Scale Scores by Sample and Testing Condition

Sample

Testing Condition GMAT GRE TOEFL Total Distraction With Headsets M

23.1

19.4

25.1

20.3

SD 3.2 5.2 6.9 4.9 n 8 25 14 47 Distraction With Headsets

and Masking Noise M

20.7

21.4

21.1

21.1 SD 6.4 6.1 6.2 6.2 n 13 28 27 68 No Distraction M

18.3

23.9

18.7

18.6


26.8

25.4

24.2

25.2

SD 7.4 6.3 7.1 6.9 n 32 69 66 167

With respect to effort/motivation, participants said that they expended more effort (or

were more highly motivated) when they took an operational test than when they tested for our

study (Table 6). In terms of effect sizes, the difference was “large” within each of the three test-

taking samples. Appendix D shows responses to each question comprising the effort/motivation

scale. For example, whereas 90% of all study participants agreed or strongly agreed that they

were “concerned about doing well” on their previous, operational test, fewer (44-59%) of the

participants in each study condition agreed or strongly agreed with this statement. It should be

noted, however, that despite the generally lower effort/motivation for our research testing,

participants’ responses suggest that, in general, they made a reasonably good effort for our study.

18

For instance, a significant majority (70-87%) agreed or strongly agreed that they “tried to do

[my] best” for our study (compared with 88% for the operational test).

Table 6 Means and SDs of Effort Scores by Sample and Test Occasion

Sample

Test Occasion GMAT GRE TOEFL Total Research Study M

19.8

21.1

22.4

21.4


25.8

26.7

25.8

26.2

SD 4.1 3.2 3.5 3.5 n 32 69 66 167

Effects on Test Performance Table 7 shows the mean test scores for each test-taker sample according to testing

condition for the tests taken for our study. Also shown are the scores that each group received on

the operational versions of the tests that participants took prior to our study. As is clear,

operational scores for each sample were somewhat higher on average than were the scores

obtained on the tests administered for our study. Scores obtained under distracting conditions

also were generally (but not always) lower than were those obtained under nondistracting

conditions in our study.

19

Table 7 Test Performance by Testing Condition

Research Test Test Score

Distraction

With Headsets

Distraction With Headsets

and Masking Noise

No Distraction

Operational Test

Score Scale Range

GMAT Sample GMAT V M 21 21 30 28 0-60 SD 5 7 6 8 GMAT Q M 32 34 37 37 0-60 SD 14 11 14 11 GMAT W M 3.3 3.3 3.7 3.9 0-6 SD 0.8 0.7 0.5 0.7 GMAT Total M 457 473 574 548 200-800 SD 103 114 114 114 GRE Sample GRE V M 468 497 460 494 200-800 SD 88 134 124 127 GRE Q M 560 578 602 582 200-800 SD 181 160 160 164 GRE A M 590 544 528 549 200-800 SD 167 163 173 163 GRE Total M 1617 1620 1590 1625 600-2400 SD 371 388 369 364 TOEFL Sample TOEFL R M 23 21 22 23 0-30 SD 2 4 5 4 TOEFL L M 23 21 23 22 0-30 SD 3 4 3 5 TOEFL W M 23 21 23 23 0-30 SD 4 5 3 3 TOEFL Essay M 4.1 4.1 4.1 4.3 0-6 SD 0.4 0.7 0.6 0.8 TOEFL Total M 222 208 227 227 0-300 SD 31 32 31 32

Note: N for GMAT sample was 26, N for GRE sample was 65, N for TOEFL sample was 51.

20

Regressing test scores obtained in our study on test scores obtained earlier under

standard, operational conditions revealed that the later scores were highly explainable from

earlier ones. Almost without exception, previous test performance explained what can be

considered to be a “large” proportion of variance (defined by Cohen and Cohen as 26%). When

we assessed the influence of distraction by adding this variable to the regression equation, we

found that it explained no more than 1% of test variance for any test taken by either the GRE or

the TOEFL sample. For the smaller GMAT sample, only for the total score did distraction

account for a statistically significant portion of test-score variance (8%). A proportion of this size

has been described by Cohen and Cohen as “small” to “medium.” Detailed results of these

regression analyses are shown in Appendix E. The general overall conclusion is that the variation

due to distraction, above and beyond that explainable by performance under standard,

nondistracting conditions, was neither practically nor (with one exception) statistically

significant.

In a similar vein, testing condition was substituted for distraction in the regression

equations to assess its explanatory power above and beyond that due to previous test

performance under standard conditions. The results of this analysis (Appendix F) were consistent

with the results using distraction score as the explanatory variable. Only for GMAT verbal score

was the effect of testing condition statistically significant (p < .05). As indicated by the

regression weights (not shown), this effect was approximately 7 points on the 0 to 60 score scale

(standard error of approximately 3 points) in favor of the nondistracting condition over each of

the distracting ones. The effect on GMAT total score barely failed to reach statistical

significance (p = .07).

Analyses that were parallel to those just described were run when test anxiety and test-

taking effort/motivation were substituted for distraction as an explanatory variable. The detailed

results of these regression analyses are shown in Appendixes G and H. Test anxiety was a

significant explanatory variable for several of the GMAT scores only. Like distraction,

differential effort/motivation had little explanatory power.

A summary of the proportion of test score variance that was explainable by each

extraneous source of variance — distraction, test anxiety, and test-taker effort/motivation — is

shown in Table 8. Also shown (in the column labeled “test constructs”) is the proportion of

variance explainable from performance on the previous, operational test. This table reveals that,

21

as suggested above, distraction had little if any role in explaining differences among study

participants’ test scores. In relation to the influence of the two other extraneous factors we

investigated, the role of distraction was minimal. In addition, the role of distraction was always

much less than the role of the skills and abilities measured by the tests (test constructs).

Table 8 Summary of Degree of Influence of Several Extraneous Sources of Variation in Test Performance

Source of Variation Test Source Distraction Anxiety Effort Test Constructs

GMAT V .11 .36*** .03 .33 (.34) GMAT Q .02 .02 .00 .86 (.87) GMAT W .09 .21** .10 .23 (.28) GMAT Total .08* .17*** .01 .68 (.69) GRE V .01 .00 .01 .72 (.75) GRE Q .01 .00 .00 .80 (.80) GRE A .00 .00 .01 .71 (.60) GRE Total .00 .00 .01 .85 (.83) TOEFL L .01 .00 .00 .77 (.73) TOEFL R .00 .01 .02 .50 (.59) TOEFL W .00 .01 .03 .50 (.55) TOEFL Essay .00 .03 .05 .21 (.25) TOEFL Total .01 .00 .01 .76 (.80)

* p<.05, ** p<.01, *** p<.001

Note: Table entries are percentages of variance explained by each source, above and beyond that explained by test performance on another occasion. The right-most numbers in parentheses are the proportions of test-score variation in previous, operational test scores that were explained by performance on the research study test.

22

Examinee Reactions to Methods Used to Block Sound

Opinion was equally mixed regarding the helpfulness of the headsets in blocking out

distraction. Although a majority of the study participants (57%) regarded the headsets as either

very or somewhat effective in blocking distractions, approximately a third (34%) rated the

headsets as either somewhat or very ineffective. When asked whether their headsets were

comfortable, 43% (of 143 participants who responded) said that the headsets were either very

comfortable or somewhat comfortable. On the other hand, a slightly greater proportion (48%)

found them to be either somewhat uncomfortable or very uncomfortable.

Masking noise was rated as slightly less effective than the headsets. Fewer than half

(44%) of the study participants said that masking noise was either very effective or somewhat

effective. Nearly a third (31%) rated masking noise as either very or somewhat ineffective. A

slight majority found it to be either very or somewhat helpful to be able to adjust the volume of

masking noise.

Observations and Debriefings

The observations of study participants and the informal debriefing of them after one of

the testing sessions provided information that was generally consistent with the results of the

questionnaire survey. The debriefings also yielded further insights or suggested additional

considerations, such as the following:

Distraction may be more noticeable and possibly more detrimental to performance on an

operational test than on our research test.

Masking noise may itself be a source of distraction for some test takers.

Some test takers may be able to accommodate to distraction after a relatively brief period

of exposure.

23

Discussion The Standards for Educational and Psychological Testing (AERA/APA/NCME, 1999)

assert that

[Test] validation involves careful attention to possible distortions in [test score] meaning

arising from … aspects of measurement such as test format, administration conditions

[emphasis added], or language level that may materially limit or qualify the interpretation

of test scores (p.10).

The study reported here was designed to assess the degree of test-score distortion that

might result from one specific administration condition. The condition was one in which test

takers from various testing programs are comingled in the same computer-based testing center

with examinees taking a test containing a speaking component. In the current study, the

distraction expected from fellow test takers under these conditions was simulated by playing

recordings of test takers who had completed a number of tasks being tried out for a new TOEFL

speaking test. Of primary interest was the degree to which distraction was found to be an undue

or disproportionate influence on test performance, especially relative to the influence of other

extraneous factors. This influence was assessed by comparing the proportion of variation in test

takers’ performance that could be attributed to (1) test takers’ abilities, as determined by their

performance on tests taken earlier under standard, distraction-free testing conditions, and (2) the

degree of distraction that examinees experienced during the testing conducted for purposes of

our study. The study also provides information regarding the degree to which distraction was

perceived to have affected test performance. Finally, the study results provide some, albeit not

very encouraging, information about how unwanted distraction from fellow test takers might be

reduced.

The results revealed that distraction had, on average, a very large impact on examinees’

perceptions. Study participants reported that they were annoyed, disrupted, distracted, and so

forth by the noise that we generated. More importantly, examinees felt that our distraction had

negative effects on their test performances. The data also suggested that although anxiety levels,

which were reported to have been lower generally during our study testing than during previous,

operational testing, were slightly, but not consistently, elevated when test takers were distracted

during our study.

24

The two methods of sound blocking that were studied — the use of headsets and the use

of headsets plus masking noise — received mixed reviews from examinees with respect to their

effectiveness. Furthermore, headsets were regarded as being uncomfortable by a significant

portion of the study sample. More importantly, neither method came close to reducing the

perceived level of distraction to the much lower level that was perceived by examinees when

tested under standard, relatively distraction-free conditions.

With respect to effects on actual test performance, distraction was estimated to be a

negligible influence for each of the tests taken by two of the test-taker samples (GRE and

TOEFL), accounting for, at most, 1% of test score variation above and beyond what could be

explained by examinee performance under relatively distraction-free conditions. The influence of

distraction on GMAT test takers was also a relatively minor influence, accounting for, at most, a

“small to medium” (2 to 11%) proportion of variance for any of the GMAT sections. Moreover,

in each sample, the estimated influence of distraction was less than the influence noted for two

other potential sources of test score invalidity — test anxiety and differential examinee

motivation/effort.

Limitations

Like most research studies, this one had its limitations. First of all, although we were

interested in the effects of distraction for all ETS computer-based tests, it was not possible within

the time frame designated for the study to include test takers from every testing program. We

did, however, include three of the major (i.e., largest volume) ETS-administered testing

programs, whose tests contain a relatively wide variety of test question types that measure

diverse cognitive skills and abilities. One of the specific skills the study did not consider was

speaking proficiency.

Secondly, because the study sample sizes were relatively small, it was not possible to

assess the extent to which effects of distraction on test performance might depend on certain

individual differences among examinees — for example, their sensitivity to noise. Other

individual differences that might be equally important in this regard were not considered either.

Moreover, because one of the study samples (GMAT) was very small, the results based on this

sample should be interpreted very cautiously.

25

Thirdly, although the data suggested that we provided a rigorous test of examinees’

ability to cope with distraction, we wonder if we established conditions that were unduly severe.

Even though our simulated distraction seemed both realistic and reasonable to us, it may have

resulted in distraction that could be regarded as excessive. On the other hand, an extensive

observation of one of the testing sessions by an ETS observer led us to question whether we had

created enough distraction, as the observer questioned whether the level of noise, which had been

predetermined in the pilot testing, was loud enough.

Finally, we were able to implement our sound-blocking methods only partially in some

cases, as we could merely encourage, not enforce, the use of headsets during the distracting

conditions. Our observer noted several instances when test takers removed headsets, sometimes

repeatedly.

Conclusions

Our main conclusion is that intermingling test takers who are taking a speaking test with

examinees who are taking other tests will pose a challenge. More effective methods need to be

devised to reduce distraction to a level that is acceptable to test takers.

From our analyses of test performance, we conclude that, in general, the kind of

distraction that we studied will not pose a major threat to the actual validity of test scores. We

must temper this conclusion by our finding that, in one of the test-taker samples (GMAT), we did

detect a small but statistically significant impact on test performance, mainly on the verbal

portion of the test.

We also conclude that the instruments and design developed for the study reported here

proved effective for detecting the effects of distraction and for evaluating the effectiveness of

procedures for reducing it. These resources should prove useful for studying other alternatives

for reducing or controlling distraction.

26

References

American Educational Research Association (AERA), American Psychological Association

(APA), National Council on Measurement in Education (NCME). (1999). Standards for

educational and psychological testing. Washington, DC: American Educational Research

Association.

Cohen, J. (1977). Statistical power analysis for the behavioral sciences (rev. ed.). New York:

Academic Press.

Cohen, J., & Cohen, P. (1977). Applied multiple regression/correlation analysis for the

behavioral sciences. Hillsdale, NJ: Erlbaum.

Jones, D. M. (1984). Performance effects. In D. Jones & A. Chapman (Eds.), Noise and society

(pp. 155-184). New York: Wiley.

Jones, D. M., & Davies, D. R. (1984). Individual and group differences in the response to noise.

In D. Jones & A. Chapman (Eds.), Noise and society (pp. 125-153). New York: Wiley.

Jones, D. M., Miles, C., & Page, J. (1990). Disruption of reading by irrelevant speech: Effects of

attention, arousal or memory? Applied Cognitive Psychology, 4, 89-108.

Jones, D. M., & Morris, N. (1992). Irrelevant speech and cognition. Handbook of human

performance, Vol. 1 (pp. 29-53). New York: Academic Press.

Kjellberg, A. (1990). Subjective, behavioral and psychophysiological effects of noise.

Scandinavian Journal of Work & Environmental Health, 16, 29-38.

Kjellberg, A., & Skoldstrom, B. (1991). Noise annoyance during the performance of different

nonauditory tasks. Perceptual and Motor Skills, 73, 39-49.

Kjellberg, A., Landstrom, U., Tesarz, M., Soderberg, L., & Akerlund, E. (1996). The effects of

nonphysical noise characteristics, ongoing task and noise sensitivity on annoyance and

distraction due to noise at work. Journal of Environmental Psychology, 16, 123-136.

Martin, R. C., Wogalter, M. S., & Forlano, J. G. (1988). Reading comprehension in the presence

of unattended speech and music. Journal of Memory and Language, 27, 382-398.

Mital, A., McGlothlin, J. D., & Faard, H. F. (1992). Noise in multiple-workstation open-plan

computer rooms: Measurements and annoyance. Journal of Human Ergonomics, 21, 69-82.

Ng, C. F. (2000). Effects of building construction noise on residents: A quasi-experiment.

Journal of Environmental Psychology, 20, 375-385.

27

Sailer, U., & Hassenzahl, M. (2000). Assessing noise annoyance: An improvement-oriented

approach. Ergonomics, 43, 1920-1938.

Sarason, I. G. (1984). Stress, anxiety, and cognitive interference: Reactions to tests. Journal of

Personality and Social Psychology, 46, 929-938.

Smith, A. P. (1985). The effects of different types of noise on semantic processing and syntactic

reasoning. Acta Psychologica, 58, 263-273.

Spielberger, C. D. (1980). Test anxiety inventory. Preliminary Professional Manual. Palo Alto,

CA: Consulting Psychologists Press.

Weinstein, N. D. (1974). Effects of noise on intellectual performance. Journal of Applied

Psychology, 59, 548-554.

Weinstein, N. D. (1977). Noise and intellectual performance: A confirmation and extension.

Journal of Applied Psychology, 62, 104-107.

Weinstein, N. D. (1978). Individual differences in reactions to noise: A longitudinal study in a

college dormitory. Journal of Applied Psychology, 63, 458-466.

28

Notes 1 Sound-control solutions for speech delivery and capture were divided into three main

categories, based on cost, technology, and logistical implementation:

��low-end solutions (headphones, headsets, earplugs, earmuffs, etc.)

��midrange solutions (masking noise, whether white, pink, or brown)

��high-end solutions (sound booths, architectural modifications, etc.)

For practical cost, timeline, and logistical considerations, the team decided to focus the research

on the low-end and midrange solutions. 2 The study design was informed by a set of small pilot studies at an on-site ETS computer-based

testing laboratory. Four sessions, each with three test takers, were conducted in order to

1. compare several models of headsets, and select the best one for the larger scale study

2. try out the logistics for the main study

3. inform the design of a questionnaire

4. resolve any operational issues, on a small scale, before moving to the larger study

Based on a $200 maximum price limit, the team compared headsets and headphones made by

a variety of manufacturers: Sony and Boss for headphones, and Plantronics, Sennheisser, and

Tandberg for headsets. Criteria used for comparison were comfort, sound quality, volume

control, durability, and cost.

The use of headsets with attached microphones was preferred over headphones with separate

microphones for the following reasons:

��The use of a headset will ensure a constant distance between the test taker and the

microphone at all times.

��The possibility of breakdown is reduced by using one device.

��Dealing with one device is more convenient operationally.

Finally, it was decided that study participants would be encouraged to use the headsets,

but would not be required to do so. It was also decided that participants would be allowed to use

earplugs instead of headsets if they so desired.

Based on the feedback from pilot study participants, the project team selected the

Tandberg headset for the main study. The Tandberg headset was perceived to be the most

comfortable; it is very robust; it has no volume-control capability (a feature preferred for

29

ensuring a speech capture); it has a nonmovable, easy-to-adjust unidirectional microphone

(another feature preferred for ensuring a speech capture); and it is relatively inexpensive (less

than $80).

Several different types of masking noise were also considered for use in the main study,

and these were tried out in the pilot study. These choices included standard “white noise,” “pink

noise,” and a newer sound algorithm called “brown noise.” Analog and pure-digital versions of

these noise algorithms were tried out. Although each of the algorithms was determined to be

adequate for the purposes of this study, we found that an analog recording of pink noise was the

most pleasing to the majority of people questioned. This version of pink noise was created from

the analog outputs of a commercially available noise generator; it had fewer frequencies above

5000 Hz than a purely digital version of pink noise.

30

List of Appendixes

Page

A. Testing Center Layout .............................................................................................................31

B. Study Participants’ Reports of Distraction on Research Study Test and on Previous

(Operational) Test ....................................................................................................................32

C. Study Participants’ Reports of Test Anxiety on Research Study Test and on Previous

(Operational) Test ...................................................................................................................33

D. Study Participants’ Reports of Effort on Research Study Test and on Previous

(Operational) Test ...................................................................................................................34

E. Regression Analyses for Influence of Distraction ...................................................................35

F. Regression Analyses for Influence of Testing Condition ........................................................38

G. Regression Analyses for Influence of Test Anxiety ...............................................................41

H. Regression Analyses for Influence of Effort ...........................................................................44

31

Appendix A

Testing Center Layout

32

Appendix B Study Participants’ Reports of Distraction on Research Study Test and on Previous (Operational) Test Research Test

Statements

Distraction With

Headset

Distraction With Headset and

Masking Noise

No

Distraction

Operational

Test I found myself thinking about how

noisy the testing room was.

79

72

23

17 The testing room was relatively free

from interruptions.

21

39

64

74 Distractions in the testing room

really annoyed me.

65

59

16

17 Any distractions probably didn’t

affect my test performance much.

24

25

41

48 It was annoying when other people

talked during the test.

74

56

33

35 During the test, my “train of

thought” was often disrupted.

79

63

32

24 I found myself thinking more about

the surroundings….

49

52

4

16 The testing room was so loud that I

couldn’t “hear” myself think.

51

36

4

4 Due to noise or other interruptions,

my test performance suffered.

55

45

9

10 The level of noise in the testing

center was very noticeable.

96

79

18

14 Distractions in the testing center

really slowed me down.

53

53

9

18 Distractions interfered with my

thinking process.

77

67

25

28 I had to exert a lot of effort to

concentrate.

70

70

29

34 I found myself listening to what

others were saying.

64

55

14

13 I had lots of “starts and stops” or

interruptions.

51

48

4

11 Because of distractions, I probably

made careless mistakes.

70

63

29

31 There was a lot of commotion in the

testing room.

49

47

9

7 Noise at the testing center was not a

problem for me.

11

19

54

48 The level of noise was

uncomfortably loud.

65

55

16

6 I felt “overloaded” because of

distractions in the testing room.

39

44

5

6 I had no problem “tuning out” any

noise or distraction.

19

31

41

40 Note: Table entries are percentages of study participants who agreed or strongly agreed with each statement.

33

Appendix C

Study Participants’ Reports of Test Anxiety on Research Study Test

and on Previous (Operational) Test

Research Test

Statements

Distraction

With Headset

Distraction With Headset and Masking

Noise

No Distraction

Operational Test

I felt tense and unsure.

38

49

14

54

Thinking about how I was doing interfered with my work.

43

43

32

55

The harder I tried, the worse I did.

13

24

7

22

Thoughts of doing poorly interfered with my concentration.

28

36

34

53

I worried a lot before taking the test.

11

9

11

67

I found myself thinking about how poorly I was doing

36

37

23

43

After the test was over I couldn’t stop worrying.

9

13

12

38

I got so nervous that I forgot things I really knew.

6

25

14

40

Note: Table entries are percentages of study participants who agreed or strongly agreed with each statement.

34

Appendix D

Study Participants’ Reports of Effort on Research Study Test

and on Previous (Operational) Test

Research Test

Statements

Distraction

With Headset

Distraction With Headset and Masking

Noise

No Distraction

Operational Test

I really tried to do my very best.

87

75

70

88

I really didn’t care how well I did.

20

22

20

3

I was motivated to do well.

60

61

55

84

I was very concerned about doing well on the test.

45

44

59

90

I didn’t try nearly as hard as I could have.

23

31

38

16

Getting a good score was not important to me.

13

22

29

5

Note: Table entries are percentages of study participants who agreed or strongly agreed with each statement.

35

Appendix E

Regression Analyses for Influence of Distraction

Table E.1 Influence of Distraction on GMAT Performance

Explanatory Variables Cumulative

R2 Increase in

R2 F for increase

in R2

df

Dependent variable: GMAT Verbal 1. GMAT V, GMAT Q, GMAT W .33 .33 3.6* 3, 22 2. GMAT V, GMAT Q, GMAT W, Distraction Score

.44

.11

4.2

4, 21

Dependent variable: GMAT Quantitative

1. GMAT V, GMAT Q, GMAT W .86 .86 45.4*** 3, 22 2. GMAT V, GMAT Q, GMAT W, Distraction Score

.88

.02

2.6

4, 21

Dependent variable: GMAT Writing

1. GMAT V, GMAT Q, GMAT W .23 .23 2.2 3, 22 2. GMAT V, GMAT Q, GMAT W, Distraction Score

.33

.09

3.2

4, 21

Dependent variable: GMAT Total

1. GMAT V, GMAT Q, GMAT W .68 .68 15.5*** 3, 22 2. GMAT V, GMAT Q, GMAT W, Distraction Score

.76

.08

7.3*

4, 21

* p<.05, *** p<.001

Note: Dependent variables are test scores from the nonoperational test that participants took for this study.

Explanatory variables are test scores from a previous, operational administration of the test.

36

Table E.2 Influence of Distraction on GRE General Test Scores


R2 Increase in

R2 F for increase

in R2

df

Dependent variable: GRE Verbal 1. GRE V, GRE Q, GRE A .72 .72 52.7*** 3, 61 2. GRE V, GRE Q, GRE A, Distraction Score

.73

.01

3.0

4, 60

Dependent variable: GRE Quantitative

1. GRE V, GRE Q, GRE A .80 .80 80.2*** 3, 61 2. GRE V, GRE Q, GRE A, Distraction Score

.81

.01

2.9

4, 60

Dependent variable: GRE Analytical


.71

.00

0.5

4, 60

Dependent variable: GRE Total


.85

.00

1.3

4, 60

*** p<.001



37

Table E.3 Influence of Distraction on TOEFL Scores


R2 Increase in

R2 F for increase

in R2

df

Dependent variable: TOEFL Listening 1. TOEFL L, TOEFL R, TOEFL W,

TOEFL Essay

.77

.77

37.8***

4, 46 2. TOEFL L, TOEFL R, TOEFL W, TOEFL Essay, Distraction Score

.78

.01

2.3

5, 45

Dependent variable: TOEFL Reading

1. TOEFL L, TOEFL R, TOEFL W, TOEFL Essay

.50

.50

11.3***

6, 46

2. TOEFL L, TOEFL R, TOEFL W, TOEFL Essay, Distraction Score

.50

.00

0.4

5, 45

Dependent variable: TOEFL Writing


.50

.50

11.6***

4, 46


.50

.00

0.1

5, 45

Dependent variable: TOEFL Essay


.21

.21

3.0*

4, 46


.21

.00

0.0

5, 45

Dependent variable: TOEFL Total


.76

.76

36.2***

4, 46


.76

.01

1.1

5, 45

* p<.05, *** p<.001



38

Appendix F

Regression Analyses for Influence of Testing Condition Table F.1 Influence of Testing Condition on GMAT Performance


R2 Increase in

R2 F for increase

in R2

df

Dependent variable: GMAT Verbal 1. GMAT V, GMAT Q, GMAT W .33 .33 3.6* 3, 22 2. GMAT V, GMAT Q, GMAT W, Testing Condition

.50

.17

3.5*

5, 20


1. GMAT V, GMAT Q, GMAT W .86 .86 45.4*** 3, 22 2. GMAT V, GMAT Q, GMAT W, Testing Condition

.87

.01

0.5

5, 20


1. GMAT V, GMAT Q, GMAT W .23 .23 2.2 3, 22 2. GMAT V, GMAT Q, GMAT W, Testing Condition

.29

.06

0.8

5, 20


1. GMAT V, GMAT Q, GMAT W .68 .68 15.5*** 3, 22 2. GMAT V, GMAT Q, GMAT W, Testing Condition

.75

.08

3.1

5, 20

* p<.05, *** p<.001



39

Table F.2 Influence of Testing Condition on GRE General Test Scores


R2 Increase in

R2 F for increase

in R2

df

Dependent variable: GRE Verbal 1. GRE V, GRE Q, GRE A .72 .72 52.7*** 3, 61 2. GRE V, GRE Q, GRE A, Testing Condition

.74

.02

2.1

5, 59


1. GRE V, GRE Q, GRE A .80 .80 80.2*** 3, 61 2. GRE V, GRE Q, GRE A, Testing Condition

.81

.01

1.6

5, 59



.72

.01

1.6

5, 59



.85

.00

0.5

5, 59

*** p<.001



40

Table F.3 Influence of Testing Condition on TOEFL Scores


R2 Increase in

R2 F for increase

in R2 df

Dependent variable: TOEFL Listening


.77 .77 37.8*** 4, 46

2. TOEFL L, TOEFL R, TOEFL W, TOEFL Essay, Testing Condition

.77 .00 0.3 6, 44



.50 .50 11.3*** 4, 46

2. TOEFL L, TOEFL R, TOEFL W TOEFL Essay, Testing Condition

.50 .01 0.3 6, 44



.50 .50 11.6*** 4, 46


.52 .02 0.7 6, 44



.21 .21 3.0* 4, 46


.21 .00 0.1 6, 44



.76 .76 36.2*** 4, 46


.76 .00 0.2 6, 44

* p<.05, *** p<.001 Note: Dependent variables are test scores from the nonoperational test that participants took for this study.


41

Appendix G

Regression Analyses for Influence of Test Anxiety Table G.1 Influence of Test Anxiety on GMAT Performance


R2 Increase in

R2 F for increase

in R2

df

Dependent variable: GMAT Verbal 1. GMAT V, GMAT Q, GMAT W .34 .34 3.8* 3, 22 2. GMAT V, GMAT Q, GMAT W, Test Anxiety Score

.71

.36

26.0***

4, 21


1. GMAT V, GMAT Q, GMAT W .87 .87 48.3*** 3, 22 2. GMAT V, GMAT Q, GMAT W, Test Anxiety Score

.89

.02

4.6*

4, 21


1. GMAT V, GMAT Q, GMAT W .28 .28 2.9 3, 22 2. GMAT V, GMAT Q, GMAT W, Test Anxiety Score

.50

.21

8.9**

4, 21


1. GMAT V, GMAT Q, GMAT W .69 .69 16.3*** 3, 21 2. GMAT V, GMAT Q, GMAT W,

Test Anxiety Score

.86

.17

25.6***

4, 21 * p<.05, ** p<.01, *** p<.001

42

Table G.2 Influence of Test Anxiety on GRE General Test Performance


R2 Increase in

R2 F for increase

in R2

df

Dependent variable: GRE Verbal 1. GRE V, GRE Q, GRE A .75 .75 60.2*** 3, 61 2. GRE V, GRE Q, GRE A, Test Anxiety Score

.75

.00

0.4

4, 60


1. GRE V, GRE Q, GRE A .80 .80 83.9*** 3, 61 2. GRE V, GRE Q, GRE A, Test Anxiety Score

.80

.00

0.1

4, 60



.60

.00

0.3

4, 60



.83

.00

0.0

4, 60

*** p<.001

43

Table G.3 Influence of Test Anxiety on TOEFL Performance


R2 Increase in

R2 F for increase

in R2

df


TOEFL Essay

.73

.73

31.1***

4, 46 2. TOEFL L, TOEFL R, TOEFL W,

TOEFL Essay, Test Anxiety Score

.73

.00

0.5

5, 45



.59

.59

16.3***

4, 46

2. TOEFL L, TOEFL R, TOEFL W, TOEFL Essay, Test Anxiety Score

.59

.01

0.9

5, 45



.55

.55

14.3***

4, 46


.56

.01

0.5

5, 45



.25

.25

3.8**

4, 46


.28

.03

1.8

5, 45



.80

.80

46.6***

4, 46


.80

.00

0.3

5, 45

** p<.01, *** p<.001

44

Appendix H

Regression Analyses for Influence of Effort Table H.1 Influence of Effort on GMAT Performance


R2 Increase in

R2 F for increase

in R2

df

Dependent variable: GMAT Verbal 1. GMAT V, GMAT Q, GMAT W .33 .33 3.6* 3, 22 2. GMAT V, GMAT Q, GMAT W, Effort Score

.36

.03

1.1

4, 21


1. GMAT V, GMAT Q, GMAT W .86 .86 45.4*** 3, 22 2. GMAT V, GMAT Q, GMAT W, Effort Score

.86

.00

0.2

4, 21


1. GMAT V, GMAT Q, GMAT W .23 .23 2.2 3, 22 2. GMAT V, GMAT Q, GMAT W, Effort Score

.33

.10

3.0

4, 21


1. GMAT V, GMAT Q, GMAT W .68 .68 15.5*** 3, 22 2. GMAT V, GMAT Q, GMAT W, Effort Score

.69

.01

0.8

4, 21

*** p<.001

45

Table H.2 Influence of Effort on GRE General Test Performance


R2 Increase in

R2 F for increase

in R2 df

Dependent variable: GRE Verbal

1. GRE V, GRE Q, GRE A .72 .72 52.7*** 3, 61 2. GRE V, GRE Q, GRE A, Effort Score

.73

.01

2.1

4, 60



.80

.00

0.3

4, 60



.72

.01

2.4

4, 60



.86

.01

3.6

4, 60

*** p<.001

46

Table H.3 Influence of Effort on TOEFL Performance


R2 Increase in

R2 F for increase

in R2

df


TOEFL Essay

.77

.77

37.8***

4, 46 2. TOEFL L, TOEFL R, TOEFL W,

TOEFL Essay, Effort Score

.77

.00

0.5

5, 45



.50

.50

11.3***

6, 46

2. TOEFL L, TOEFL R, TOEFL W, TOEFL Essay, Effort Score

.52

.02

2.2

5, 45



.50

.50

11.6***

6, 46


.54

.03

3.3

5, 45



.21

.21

3.0*

6, 46


.25

.05

2.7

5, 45



.76

.76

36.2***

6, 46


.77

.01

1.9

5, 45

* p<.05, *** p<.001

57906-005535 • Y42M.7 • Printed in U.S.A.

I.N. 993553

Test of English as a Foreign LanguageP.O. Box 6155

Princeton, NJ 08541-6155USA

��

To obtain more information about TOEFL

programs and services, use one of the following:

Phone: 609-771-7100Email: [email protected]

Web site: http://www.toefl.org

®

Documents

Influence of Irrelevant Speech on Standardized Test Performance