16
87 2010 4 中国应用语言学(双月刊) Apr. 2010 33 2 Chinese Journal of Applied Linguistics (Bimonthly) Vol. 33 No. 2 An Application of Classical Test Theory and Many- facet Rasch Measurement in Analyzing the Reliability of an English Test for Non-English Major Graduates SUN Haiyang Beijing Foreign Studies University Abstract Taking classical test theory (CTT) and many-facet Rasch measurement (MFRM) model as its theoretical basis, this study investigates the reliability of an English test for non-English major graduates by using SPSS and FACETS. The results of the CTT reliability study show that the candidates’ scores of the objective test were not significantly correlated with their scores of the subjective tasks, and the internal consistency of the three subjective tasks was not satisfying, either. The results of the MFRM analysis indicate that it was the two raters’ severity difference in their rating, the varying difficulty levels of the test tasks, and the bias interaction between some students and certain tasks that caused most of the variance in the scores. This demonstrates the necessity of training raters to be not only self-consistent but also coherent and consistent with each other. It also requires systematic study of measurement theory as well as test item writing or test task design techniques on the part of the teachers as item writers or task designers. In addition, it calls upon English teachers’ attention to enhancing students’ comprehensive language skills. Key words: classical test theory; many-facet Rasch measurement; reliability; bias analysis 1. Introduction The Ministry of Education initiated Non-English Major Graduate Student English Qualifying Test (GET 1 ) from 1999 and cancelled it in 2005. Graduate schools of many universities in China took scores from this test as reference to determine whether a student’s English proficiency was good enough to deserve a master’s degree. After the cancellation of the test, graduate schools of most universities began to organize their own

An Application of Classical Test Theory and Many- … · An Application of Classical Test Theory and Many- ... universities in China took scores from this test as reference to determine

Embed Size (px)

Citation preview

87

2010 年 4 月 中国应用语言学(双月刊) Apr. 2010

第 33 卷 第 2 期 Chinese Journal of Applied Linguistics (Bimonthly) Vol. 33 No. 2

An Application of Classical Test Theory and Many-facet Rasch Measurement in Analyzing the Reliability of an English Test for Non-English Major Graduates

SUN HaiyangBeijing Foreign Studies University

AbstractTaking classical test theory (CTT) and many-facet Rasch measurement (MFRM) model as

its theoretical basis, this study investigates the reliability of an English test for non-English

major graduates by using SPSS and FACETS. The results of the CTT reliability study show

that the candidates’ scores of the objective test were not significantly correlated with their

scores of the subjective tasks, and the internal consistency of the three subjective tasks was not

satisfying, either. The results of the MFRM analysis indicate that it was the two raters’ severity

difference in their rating, the varying difficulty levels of the test tasks, and the bias interaction

between some students and certain tasks that caused most of the variance in the scores. This

demonstrates the necessity of training raters to be not only self-consistent but also coherent

and consistent with each other. It also requires systematic study of measurement theory as well

as test item writing or test task design techniques on the part of the teachers as item writers

or task designers. In addition, it calls upon English teachers’ attention to enhancing students’

comprehensive language skills.

Key words: classical test theory; many-facet Rasch measurement; reliability; bias analysis

1. Introduction

The Ministry of Education initiated Non-English Major Graduate Student English Qualifying Test (GET1) from 1999 and cancelled it in 2005. Graduate schools of many universities in China took scores from this test as reference to determine whether a student’s English proficiency was good enough to deserve a master’s degree. After the cancellation of the test, graduate schools of most universities began to organize their own

88

An Application of Classical Test Theory and Many-facet Rasch Measurement…

English teachers to write test items by following the test framework of the Education Ministry. Due to their small-scale and informal status, these tests have rarely been analyzed for validity and reliability. The present study was carried out to fill in the gap. Based on classical test theory (CTT) and many-facet Rasch measurement (MFRM) model, this research aims to use FACETS and SPSS to analyze the reliability of a test constructed by English teachers of a key university in Hebei province.

1.1 Classical test theoryClassical test theory provides several ways of estimating reliability mainly by distinguishing true scores from error scores. The true score of a person can be obtained by taking the average of the scores that the person would get on the same test if he or she took the test an infinite number of times. Because it is impossible to obtain an infinite number of test scores, true score is hypothetical (Kline, 2005). Sources of error scores might be random sampling error, internal inconsistencies among items or tasks within the test, inconsistencies over time, inconsistencies across different forms of the test, or inconsistencies within and across raters. Under CTT, reliability can be estimated by calculating the correlation between two sets of scores, or by calculating Cronbach’s alpha, which is based on the variance of different sets of scores (Bachman, 1990). The higher the value of Cronbach’s alpha is, the better the consistency level of the test will be. As a rule of thumb, under CTT the internal consistency reliability is usually measured by calculating Cronbach’s alpha, while the inter-rater reliability is assessed by calculating Cohen’s kappa if the data is interval scale or Spearman correlation coefficient if the data is rank ordered scale.

CTT estimates of reliability are useful in detecting the general quality of the test scores in question. However, these estimates have several limitations. Firstly, each CTT estimate can only address one source of measurement error at a time, and thus cannot provide information about the effects of multiple sources of error and how these differ. Secondly, CTT treats all errors to be random or “unidimensional” (see Baker, 1997), and thus CTT reliability estimates do not distinguish systematic measurement error from random measurement error. Lastly, CTT has a single estimate of standard error of measurement for all candidates (Weir, 2005). These limitations of CTT are addressed by item response theory (IRT).

1.2 Item response theory and many-facet Rasch measurementSince the 1960s there has been a growing interest in item response theory, a term which includes a range of probabilistic models that allow us to describe the relationship between a test taker’s ability level and the probability of his or her correct response to any individual item (Lord, 1980; Shultz & Whitney, 2005). Early IRT models were developed to examine dichotomous data2. By the 1980s, IRT models were being developed to examine polytomous data. IRT has found tremendous use in computer adaptive testing and in designing and developing performance tests.

Three currently used IRT models are respectively one-parameter logistic, two-parameter logistic and three-parameter logistic model (The one-parameter model is also

89

SUN Haiyang

referred to as the Rasch model3). All three models have an item difficulty parameter b, which is the point of inflection of the ability θ scale. Both b and θ are scaled using a distribution with a mean of 0 and a standard deviation of 1.0. Therefore, items with higher b values are “more difficult”, the respondent must have a higher level of θ to pass or endorse them. The three- and two-parameter models also have a discrimination parameter a, which allows items to differentially discriminate among examinees. Technically a is defined as the slope of the item characteristic curve4 at the point of inflection (see Baker, 1985: 21). The three-parameter model also has a lower asymptote parameter c, which is sometimes referred to as pseudochance5 (see Harris, 1989). This parameter allows for examinees, even ones with low ability, to have perhaps substantial probability of correctly answering even moderate or hard items. Theoretically c ranges from 0 to 1.0, but is typically lower than 0.3.

IRT rests on the premise that a test taker’s performance on a given item is determined by two factors: one is the test taker’s level of ability; the other is the characteristics of the item. It assumes that an observed score is indicative of a person’s ability (Fulcher & Davidson, 2007). All models under IRT assume that when their ability corresponds to item difficulty, the probability of a test taker getting the item correct will be 0.5. On a scale, therefore, some items would have lower values (be easier), and the probability of more able students passing the item would be very high. As the difficulty value of the item rises, a test taker must be more able to ensure a higher probability of getting the item correct.

Many-facet Rasch measurement (Linacre 1989) is an extension of one-parameter Rasch (Rasch, 1980) model, which is a special case of IRT model, namely one-parameter logistic model. It enables us to include multiple aspects, or facets, of the measurement procedure in the test results analysis. A facet of measurement is an aspect of the measurement procedure which the test developer believes may affect test scores and hence needs to be investigated as part of the test development. Examples of facets are task or item difficulty, rater severity, rating condition, etc. All estimates of the facets are expressed on a common measurement scale, which expresses the relative status of elements within a facet, together with interactions between the various facets, as probabilities; the units of probability used on the scale are known as logits6. This analysis allows us to compensate for differences across the facets.

MFRM also provides information about how well the performance of each individual person, rater, or task matches the expected values predicted by the model generated in the analysis (Sudweeks et al, 2005). MFRM allows us to identify particular elements within a facet that are problematic, or “misfitting”7, which may be a rater who is unsystematically inconsistent in his or her ratings, a task that is unsystematically difficult, or a person whose responses appear inconsistent. These “fit statistics” are reflected by Infit and Outfit Mean Square in MFRM analysis. According to many-facet Rasch model, the Infit and Outfit have an expected value of 1.0, with a standard error of 0. Many researchers (e.g. Lunz & Stahl, 1990; Wright & Linacre, 1994; etc.) hold that a reasonable range of Infit and Outfit values is between 0.5 and 1.5.

The MFRM analysis also reports the reliability of separation index and the separation ratio. These two statistics describe the amount of variability in the measures estimated by the MFRM model for the various elements in the specified facet relative to the precision

90

An Application of Classical Test Theory and Many-facet Rasch Measurement…

by which those measures are estimated. The reliability of separation index for each facet is a proportion ranging between 0 and 1.0, while the separation ratio ranges from 1.0 to infinity. However, the interpretation of these two statistics is different for various facets. For the person facet, low values in either of these two statistics may be indicative of central tendency error in the ratings, meaning that the raters were unable to distinguish the performance of the test takers (Myford & Wolfe, 2003, 2004). Low values of these two statistics for facets (e. g. rater, task, etc.) other than persons indicate a high degree of consistency in the measures for various elements of that facet.

In addition, MFRM model allows for analysis of “bias”, or instances of interactions between the facets. Through bias analysis, we can identify specific combinations of facet elements, such as particular rater by person or person by task combinations. Bias analysis can indicate whether one rater tended to rate an individual differently than all others, or whether a person systematically performed differently on one task than he or she did on another. Each interaction between the facets specified in the analysis is given a bias score on the logit scale, and its significance is designated by a standard z-score. The z-score with an absolute value equal to or greater than 2.0 indicates significant interactions.

1.3 Previous studiesAlthough item response theory and related models have been applied and studied for over 40 years, classical test theory and related models have been researched and applied continuously and successfully for well over 80 years because of its simpler mathematical computation and straightforwardness. Hambleton & Jones (1993) described the theoretical possibility of combining the two measurement frameworks in test development. Bechger et al. (2003) discussed at length the use of CTT in combination with IRT in item selection process of a real high-stakes test development. They suggested the use of CTT reliability estimation when an appropriate IRT model was found in test construction process. Fan (1998) did an empirical study to compare the item and person statistics from the two measurement frameworks, and he found that the results of the two were quite comparable.

Most of the reliability indices in second language performance test were still estimated through Cronbach’s alpha or correlation coefficients (e.g. Saif, 2002; Shameem, 1998; Shohamy et al., 1993; Weir & Wu, 2006; etc.) under CTT framework, whereas the issue of intra-rater variability and task variability has been extensively studied by employing MFRM analysis (e.g. Kondo-Brown, 2002; Lumley & McNamara, 1995; McNamara, 1996; Weigle, 1998; Weir & Wu, 2006; etc.). However, few if any empirical studies in language testing field address the issue of comparing CTT and MFRM reliability analysis. This study was carried out to compare the results of the reliability study by the two frameworks.

1.4 The present studyStandardized test developers usually take rigorous steps to design, administer, and analyze their tests. However, small-scale nonstandardized tests have rarely been studied for their validity and reliability. The test items of these tests were usually written by the course

91

SUN Haiyang

teachers relying merely on their intuition. Many teachers know little about the test theories and they scarcely consider whether their test items or tasks are valid enough to measure what they intend to test and whether the scores from the test paper are reliable enough to evaluate a person’s language proficiency.

Based on CTT and MFRM, the present study was carried out to analyze the reliability of a nonstandardized test, a qualifying English test for non-English major graduates, by employing SPSS and FACETS. The second purpose of this study is to demonstrate the complementary roles CTT and MFRM play in test score analysis.

2. Methodology

2.1 ParticipantsFifty-six students of one normal class, with fourteen boys and forty-two girls, took the test. They are all non-English major graduate students. Some of them major in information technology, some are economics majors, and others major in engineering. Their ages vary from twenty-three to thirties. No one is above thirty-five years old. At the time of the test they had finished their first year of graduate study. One year before the test all of them took the national entrance examinations for graduates, and their English scores were above 55 out of a total score of 100.

2.2 The instrumentThe test items used in this study were developed by English teachers at a national key university in Hebei province. The test consists of five parts, including listening comprehension, vocabulary, reading comprehension, translation and writing. In these five parts, listening comprehension, vocabulary and reading comprehension all take the format of multiple choice questions and are objective in nature, whereas translation, which is composed by English-to-Chinese (E-to-C) and Chinese-to-English (C-to-E) translation, and writing are to be evaluated subjectively. The weighted scores of the five parts are respectively 20, 20, 30, 20 (E-to-C and C-to-E translation each takes up 10), and 10, which add up to a total score of 100.

The five parts were developed separately by five groups, with each group being made up of two teachers. Some of the items (such as those in the vocabulary part) were adapted from the exercises in the textbook, and others were created by the teachers who assumed that the test takers were at the higher-intermediate level of English proficiency.

Listening comprehension consists of ten questions based on ten conversations and ten items based on three passages. The vocabulary part includes ten questions measuring the synonyms of the underlined words or phrases, and ten items testing sentence comprehension. The reading comprehension encompasses five passages with three or more questions following each, totaling up to 30 multiple choice questions. In this part, three reading texts, covering different aspects of daily life such as reflections on social problems, news reports, etc., are familiar to the students, and two texts on science reports are unfamiliar. A paragraph of about 200 English words and a paragraph of approximately 100 Chinese characters are

92

An Application of Classical Test Theory and Many-facet Rasch Measurement…

given as the translated materials. The writing is controlled with a given title. Students are expected to write an argumentative composition with no less than 150 words.

2.3 Data collectionThe participants were asked to finish the test paper within two and a half hours, and were asked to write their answers to the objective test on the answer cards, which were marked by the computer, and to write the translations and the compositions on the answer sheets, which were evaluated by two raters independently later on.

The objective part has a total score of 70. Because the computer produced only the total score of the objective part, the final analysis did not address the internal consistency of the three components of the objective part.

All three tasks in the subjective test were rated on a ten-point holistic rating scale. What should be noted is that the researcher intentionally treated E-to-C and C-to-E translation as two different tasks, making the number of the subjective tasks three in total. The reason for doing this is that the more tasks there were the more significant and reliable the MFRM analysis would be. A rating of 6 was considered to be the cut-off score, indicating the minimum required competence. Scoring categories above or below 6 were grouped by two or three, with 1 to 3 signifying little or no success, 4 to 5 inadequate, 7 to 8 adequate, and 9 to 10 excellent. Two raters were introduced to the detailed rubric of rating and practiced rating several papers of varying qualities before the scoring began.

2.4 Data analysisThe CTT inter-rater reliability was estimated by calculating Spearman’s rho (ρ) of the two raters’ rating of each task, and the internal consistency reliability was assessed by calculating Cronbach’s alpha (α) of the tasks in question. The reason for using Spearman rank correlation to estimate the inter-rater reliability is that the data were non-interval in a strict sense and it was inappropriate to use Cohen’s kappa or Pearson’s correlation. SPSS version 15.0 was used to do the analysis.

The MFRM analysis of the three subjective tests scores was completed using Minifac, student version of FACETS. In the present study, three facets were analyzed, including persons, raters, and tasks. The three facets had fifty-six (the number of the test takers), two (the number of raters), and three (the number of tasks) elements in them respectively.

Bias analyses, which are also called interaction analyses, were performed for all two-way interactions between the facets. Thus, the final output of MFRM analysis will report ability measures and fit statistics for persons; difficulty estimates and fit statistics for tasks; severity estimates and fit statistics for raters; and separation ratio and reliability index for each facet; bias analyses for rater by person, rater by task, and person by task interactions.

3. Results

3.1 CTT reliability studyThe results of the reliability coefficients for raters and tasks are summarized in Table 1.

93

SUN Haiyang

As shown in Table 1, the inter-rater reliability for all three subjective tasks was moderately high. The Spearman correlation coefficient rhos for the three tasks were respectively 0.853, 0.774 and 0.678, which were all significant at p < .000, indicating a high level of consistency between the two raters. However, the overall low alpha values of the inter-task correlations indicate that the internal consistency of the test tasks was not good. The alpha value of the objective and three subjective tasks was 0.201, suggesting the candidates’ scores varied significantly over the objective and the subjective parts of the test. The Cronbach’s alpha value of the three subjective tasks was 0.366, indicating a low consistency among the subjective tasks. Table 1 also reveals that the alpha value of the two translation tasks was much higher than other alphas (α = 0.712), suggesting the students’ scores over these two translation tasks were adequately consistent.

Table 1. Reliability coefficients for raters and tasks

Reliability Statistics

Inter-rater

C-to-E translation ρ= 0.853, p < .000 (No. of raters = 2)

E-to-C translation ρ= 0.774, p < .000 (No. of raters = 2)

Writing ρ= 0.678, p < .000 (No. of raters = 2)

Inter-task

Three subjective tasks α= 0.366 (No. of tasks = 3)

Two translation tasks α= 0.712 (No. of tasks = 2)

Objective to subjective tasks α= 0.201 (No. of tasks = 4)

Under CTT, the intra-rater reliability cannot be obtained for a single rating. In order to know the intra-rater consistency, researchers have to ask raters to rate the same test twice with some intervals in between, this might cause some unwanted errors in the rating. However, MFRM analysis can estimate the intra-rater consistency without double ratings. This is done by assessing the Infit or Outfit Mean Squares of each rater. The misfitting rater identified according to the Infit Mean Squares is the one who is not self-consistent.

As was mentioned earlier, the variance of the scores might be caused by inconsistent rating, inconsistent test items or the test taker’s inconsistent performance over different test tasks. A good test should minimize the influence of inconsistent test items and raters. The results of CTT reliability analysis give us only a general picture of the internal consistency of the test and raters. As for what caused the inconsistency, we expect FACETS analysis would give us the answer.

3.2 MFRM analysis

3.2.1 Persons

Table 2 provides a summary of selected statistics on the ability scale for the 56 test candidates. The mean ability of examinees was 1.31 logits, with a standard deviation of 0.82. The range was from –0.13 to 3.79 logits. The person separation reliability index (the proportion of the observed variance in measurements of ability which is not due to measurement error) was 0.77, which suggests that central tendency error was not a big

94

An Application of Classical Test Theory and Many-facet Rasch Measurement…

problem in the ratings of the examinees (Myford & Wolfe, 2004) and the analysis was moderately reliable to separate examinees into different levels of ability. The separation index 1.82 indicates that the dispersion of language ability estimates was 1.82 times greater than the precision of those measures. The chi-square of 208.5 was significant at p < .00, therefore, the null hypothesis that all students were equally able must be rejected.

Table 2. Summary of statistics on examinee facet (N = 56)

Mean ability 1.31

Standard deviation 0.82

Root Mean Square standard error 0.40

Separation index 1.82

Separation reliability index 0.77

Fixed (all same) chi-square 208.5 (df = 55, p < .00)

In order to identify students who exhibited unusual ability variances among the three tasks, fit statistics were examined. As was noted in 1.3, although there are no hard-and-fast rules for determining what degree of fit is acceptable, many researchers believe that the lower and upper limits of fit values are 0.5 and 1.5 respectively for mean squares to be useful for practical purposes. Fit statistics 1.5 or greater indicate too much unpredictability in the examinee’s scores, while fit statistics of 0.5 or less indicate overfit, or not enough variation in scores. Linacare suggested that “values less than 1.5 are productive of measurement. Between 1.5 and 2.0 are not productive but not deleterious. Above 2.0 are distorting” (cited in Myfold & Wolfe, 2003). Based on this rule, the Infit Mean Squares of 5 out of the 56 candidates (representing 8.9%) in this study were found to be above 2.0, and thus misfitting the model. In other words, these five students’ performance showed significant variability from the expected model. Besides, 20 out of 56 (35.7%) examinees were found to have an Infit Mean Square value lower than 0.5, meaning there was less variation in their scores. The lack of variation in the scores of a large percentage of students might be attributed to the small number of the tasks in question. The number of misfitting examinees is a problem, given that Pollitt & Hutchinson (1987) point out that we would normally expect around 2% of misfitting examinees. This would suggest revisions in the test structure by deleting or modifying misfitting tasks or training the raters if there exist some misfitting tasks, misfitting raters or bias interaction between the facets.

3.2.2 Tasks

The results for the task facet analysis are presented in Table 3. The task fit statistics document whether the tasks were graded in a consistent manner among the raters. The Infit Mean Squares of the three tasks were respectively 0.85, 1.16, and 1.12, which were all within the acceptable range, suggesting the tasks were graded in a consistent manner. This implies that the raters gave more difficult tasks lower ratings than easier tasks in a consistent manner.

95

SUN Haiyang

As mentioned earlier, low values of reliability statistics for the task facet indicate high degree of consistency or equal difficulty level among the tasks. In contrast, high values suggest inconsistency or separated difficulty levels. The logit scores of –0.57, –0.08 and 0.65 of the task facet in the current study, together with a very high separation ratio (6.01) and separation reliability coefficient (0.97), demonstrate that task difficulties were clearly and reliably separated along the continuum. That is to say, the three tasks were not equally challenging to the students, with writing as the most challenging and E-to-C translation as the least challenging. This separation was statistically significant (p < .00), with a chi-square of 117.9 and 2 degrees of freedom.

Table 3. Results of task facet analysis

Tasks Measure logit Model error Infit Mean Square Difficulty level

E-to-C translation –0.57 0.09 0.85 The easiest

C-to-E translation –0.08 0.08 1.16 Less difficult

Writing 0.65 0.07 1.12 Most difficult

root mean square error = 0.08; adjusted SD = 0.49; separation = 6.01; reliability = 0.97; fixed (all same) chi square = 117.9,

df = 2, p < .00

As expected, writing was found to be the most difficult task, with a logit measure of 0.65. This might be ascribed to the fact that writing as productive language ability is much more difficult than any other skills. Compared to translation, which measures test takers’ ability to find the corresponding expressions among two languages, writing assessment is definitely more challenging. Within the two translation tasks, it turned out that E-to-C translation (logit difficulty = –0.57) was easier than C-to-E translation (logit difficulty = –0.08). It is easier and more convenient for the students to get their Chinese translation sentences cohesively and coherently organized, even though they did not fully understand the English prompt passage. However, with limited proficiency in English, students might have trouble in finding the appropriate English counterparts of Chinese vocabulary and in structuring the sentences when completing C-to-E task.

3.2.3 Raters

The results of the rater behavior analysis are displayed in Table 4. For raters, a small value of reliability index is desirable, since ideally different raters would be equally severe or lenient (McNamara, 1996; Myford & Wolfe, 2004; etc.). According to McNamara (1996), the label “reliability index” for this statistic is “a rather misleading term as it is not an indication of the extent of the agreement between raters but the extent to which they really differ in their levels of severity”(p. 140). In other words, the reliability of rater separation index indicates to what extent the raters are reliably different rather than to what degree they are reliably similar. In the present case, the reliability was 0.91 for the two raters, indicating the analysis was reliably separating the raters into different levels of severity. The rater separation ratio of 3.16 indicates that the differences between the severity/leniency estimates for the two raters were not likely to be due to sampling error because it was 3.16

96

An Application of Classical Test Theory and Many-facet Rasch Measurement…

times greater than the estimation error with which these measures were estimated. The chi-square of 22.0 (df = 1) was significant at p < .00, therefore, the null hypothesis that all rater were equally severe must be rejected. The range of severity difference was 0.44 logits. These indicators of the magnitude of severity differences between the two raters indicate that significant harshness did exist: rater 2 was harsher than rater 1 in their rating of the students’ translation and composition. This result is in contrast with the higher inter-rater consistency level by the CTT analysis.

Table 4. Rater characteristics

Rater number Severity logit Model error Infit Mean Square

1 –0.22 0.7 1.19

2 0.22 0.6 0.96

Infit Mean Square mean = 1.08, SD = 0.12; separation = 3.16; reliability = 0.91; fixed (all same) chi square = 22.0, df = 1, p < .00

There is another interpretation of Infit Mean Square, that is, to interpret the Infit Mean Square value against the mean and the standard deviation of the set of Infit Mean Square values for the facet concerned. A value greater than the mean plus twice the standard deviation would be considered as misfitting for these data (McNamara, 1996: 173). In this specific case, the Infit Mean Square mean was 1.08, with a standard deviation of 0.12, so a value greater than 1.32 would be misfitting. Since the Infit Mean Square values for the two raters were 1.19 and 0.96, both of which were smaller than 1.32, neither of the raters was misfitting. In other words, both raters were self-consistent in their own scoring.

3.2.4 Bias analysis

MFRM uses the term “bias” differently from its more familiar meaning in education and culture contexts. An MFRM bias analysis can allow us to identify patterns in relation to different interaction effects of the facets in the data matrix; these patterns suggest a consistent deviation from what we would expect. For the three facets involved in the model, three two-way combinations can be generated, namely person by task, rater by person, and rater by task.

3.2.4.1 Rater by person bias

A bias analysis was carried out for rater-person interaction. This identifies consistent subpatterns of ratings to help us to find out whether particular raters are behaving in similar ways for all test takers, or whether some examinees are receiving overgenerous or harsh treatment from given raters.

Z-scores over an absolute value of 2.0 are held to demonstrate significant bias. In this data set, there were 112 rater-person interactions (2 raters × 56 examinees). Of these, no interaction showed significant bias. When there are interactions between particular raters and particular examinees, double ratings might be necessary.

97

SUN Haiyang

3.2.4.2 Rater by task biasA bias analysis was also carried out on the interaction of raters with tasks. The meaning of the bias considered here is as follows: in each instance of significant bias, the rater involved is responding consistently to the task in a way which is both different from other raters, and different from his or her own behavior in relation to other raters. In this study there were 6 possible instances of interaction (2 raters × 3 tasks), none of which showed significant bias.

3.2.4.3 Person by task biasA bias analysis was carried out on the interaction of students with tasks as well. The meaning of the bias here is that in each instance of significant bias, the examinee involved is responding consistently to the task in a way which is both different from other examinees, and different from his or her own behavior in relation to other tasks. In the current study there were 168 possible interactions (56 persons × 3 tasks), among which 13 out of 168 or 7.7% of the total showed significant bias. Table 5 presents some examples of significantly biased interactions. The information such as student ability, item difficulty, discrepancy, z-score was reported. The value of z-score greater than +2 suggests the student systematically performed better in that task than expected. As shown in Table 5, with a z-score of 3.27, student No. 22 performed better than predicted in writing. On the contrary, a value smaller than –2 indicates the student did worse than normal in the task. For example, student 37 got a z-score of –2.04, implying that he or she did worse than expected in C-to-E translation task.

Table 5. Examples of significantly biased interactions between students and tasks

Student ID Task

Student

ability

(logit)

Item

difficulty

(logit)

Predicted

score

Observed

score

Discre-

pancy

(logit)

Error Z-score

22 writing 0.08 0.65 9.0 17.0 2.03 0.62 3.27

15 C-to-E 1.05 –0.08 16.1 11.0 –1.32 0.45 –2.93

17 E-to-C 2.00 –0.57 19.0 16.0 –1.46 0.57 –2.57

37 C-to-E 0.65 –0.08 14.7 11.0 –0.92 0.45 –2.04

The observed score is the summation of the scores by two raters on the specific task.

Among the three tasks, the one most frequently involved in person by task bias was C-to-E translation (7 out of 13). For these 7 instances, each of the student got a z-score lower than –2. This could mean that C-to-E translation task was somewhat more difficult for the students involved. The rest of the bias instances were evenly distributed between the other two tasks (3 vs. 3). These findings suggest the improvement of C-to-E translation prompts.

Another interesting finding is that 4 out of the 5 students who were found misfitting were involved in the person-task bias interaction cases. They were student 15, 22, 32 and 50. This is understandable because it is possible that the bias interaction between these students and certain tasks that has led to the significant unpredictable variability of their scores.

98

An Application of Classical Test Theory and Many-facet Rasch Measurement…

MFRM analysis also provides estimates of random inconsistent ratings. According to Bachman (2004), if a particular rater rated performance on different tasks inconsistently more or less at random; this would be a problem of random inconsistency. On the other hand, if a rater consistently rated performance on one task more severely than another, this would be a case of biased ratings. Those analyzed above were all biased ratings, the comparison of the total number of inconsistent and biased ratings in this study are reported in Table 6.

Table 6. Analysis of interactions

Interactions Person by rater Person by task Rater by task

Inconsistent ratings 13/336 3/336 2/336

Biased ratings 0/112 13/168 0/6

In Table 6 the numbers before the slash indicate the number of interactions that were identified as inconsistent or biased, while the numbers after the slash indicate the number of ratings used in estimating the inconsistencies. As displayed in Table 6, there were very few significantly inconsistent ratings in this study. However, systematic patterns of bias were observed in person by task interactions.

4. Discussion

The results of Cronbach’s alpha analysis indicate that the internal consistency of the whole test was not fairly good, and the consistency level of the three subjective tasks was not satisfying, either. Then, what caused the inconsistency? The MFRM analysis reveals that even though there were neither misfitting tasks nor misfitting raters, the three subjective tasks were at significantly distinctive difficulty levels, and one rater was consistently harsher than the other. Therefore, we think that it was the varying difficulty of the tasks, the inconsistent severity of the two raters, and the bias interaction between some test takers’ ability and certain tasks that contributed to most of the variance.

As noted earlier, 8.9% of the students showed inconsistently varying ability across the three tasks. This inconsistent variability of ability might be caused by the varying task difficulty levels, the inconsistent raters, the fact that students did have imbalanced abilities over these tasks, or the interactions between them. The subsequent analysis yields that it was the bias interactions between those inconsistent students and certain tasks that triggered most of the variance in ability. Therefore, even though there was no misfitting tasks, the finding that quite a number (7 out of 13) of the significant person-task bias cases were with C-to-E translation task requires either the revision of this task or enhancing students’ competence in this task on the part of the teachers. One possible account for the students’ inconsistent competence over different tasks is that school English teaching might have focused exclusively on some aspects of the language skill while neglecting others. This calls upon English teachers’ attention to fostering students’ comprehensive English ability, especially C-to-E translation and writing ability.

99

SUN Haiyang

A striking difference between the two approaches was found in the contrasting results for the inter-rater consistency level. The results of MFRM analysis show that rater 2 was significantly more severe than rater 1, which is in contrast to the relatively high inter-rater correlation coefficients generated by CTT reliability study. This might be due to the fact that the two analyses operate with differing levels of detail. The Spearman correlation analysis of the raters was done task by task, producing three different coefficients varying from 0.678 to 0.853. By doing this, some of the variance might be ignored. MFRM analysis, on the other hand, reveals potential blemish on the measurement surface by employing a microscope analogy. The two raters’ ratings of each student’s performance in each task were compared case by case, and finally the two raters’ severity was specified along a logit continuum. By doing this, MFRM analysis generates more accurate results than CTT correlation studies. Therefore, in our case, even though the two raters were self-consistent in their rating (this is indicated by the finding that there was no misfitting rater according to the MFRM analysis), they should be trained to be coherent and consistent with one another. As for the number of the raters, because of time limits, only two raters were invited to do the rating, in future study more raters should be involved to gauge the number of minimum raters required to ensure a consistent and reliable rating.

Another matter of concern is that the low alpha values can also be interpreted as such that the test does not measure the same construct. Does this test measure the same construct? It depends on how we define the construct of the test. If we define the construct of the test as general English proficiency of non-English majors, the test did measure the same construct. Nevertheless, if we define the construct of this test as English proficiency being composed by differing subskills, it is understandable that the three tasks in this study did not measure the same construct. The three tasks chosen to be analyzed in this study might represent two different subskills of the same language competence, namely writing and translation. This interpretation of construct can also explain the high alpha values of the two translation tasks; at least they measure the same translation skill, one component of the anguage competence.

For a specific ability, a variety of techniques and approaches can be applied to measure it. That is the reason why researchers and experts of testing and measurement have always been seeking new methods. This study analyzed the tasks which measure different components of the same language ability, and further study for the analysis of different tasks measuring the same subskill of language such as writing and translation will be necessary in the sense that it will help to identify tasks which are more reliable and valid in measuring these skills.

5. Conclusion

To sum up, the findings of this study demonstrate the complementary roles traditional CTT reliability study and MFRM play in test reliability analysis. CTT reliability study generates a sketchy picture of the internal consistency of the test and the inter-rater agreement on task level, yet MFRM provides us with detailed information about what

100

An Application of Classical Test Theory and Many-facet Rasch Measurement…

caused the inconsistency or where the disagreement is. Though each method’s level of focus may make it appropriate in different measurement contexts, the information from each analysis may be used to complement the other.

This study casts some light on English teaching and test development. Firstly, raters should be trained to be not only self-consistent but also in good consistency with one another. Secondly, teachers should pay more attention to students’ overall English proficiency. Thirdly, the traditional format of each subskill of English being measured by one task needs to be improved. More tasks measuring the same subskill of language competence should be developed in a test to fully measure the students’ real ability. Finally, the fundamental knowledge of measurement and item writing techniques is necessary for any teachers who will be involved in test development process.

Notes

1. GET is the acronym of “Graduate English Test”. The full name is “Non-English Major

Graduate Student English Qualifying Test (硕士研究生英语学位课程考试)”. There were a

variety of practice test books during the time when this test was implemented.

2. Dichotomous data are data got from multiple choice items, usually with 0 standing for incorrect

and 1 standing for correct responses; polytomous data are got from tasks or items which have

more than two response options, such as Likert-type response scales of 1-5.

3. From the perspective of modeling data, the Rasch model is a special case of the 2-parameter

logistic model, but it is often referred to as the one-parameter model. The reason for the name

two-parameter logistic model is that the discrimination parameter is conceived as a second

item parameter. This label implies that discrimination parameters are conceived as pertaining

only to items, whereas Rasch (1977) emphasized the importance of the frame of reference

for measurement as a whole. In the Rasch framework, therefore, discrimination cannot be

regarded as something pertaining only to items. This is an additional distinction between the

perspectives inherent in the use of the different models and the terminology employed by

different authors.

4. Item Characteristic Curve is also referred to as trace lines which describe the performance of

an item in IRT. It gives the probability that a person with a given ability level will answer the

item correctly.

5. It is known as cheating or guessing parameter in ordinary terms.

6. A logit is a unit on a log-odds scale. With IRT the units of the probability that a person of

ability X would get an item of difficulty Y correct, which define the scale of measurement,

are called “logits”. This scale is centered on 0 and can vary in both positive and negative

directions. Positive values on the logit scale indicate greater person ability and greater item

difficulty or greater harshness on the part of raters. Negative logit values indicate lower

ability, easier items or more lenient raters. Logit values typically range between +2 and –2. For

detailed discussion, see Lynch, 2003.

7. “Misfitting” is a special term coined in MFRM, indicating problematic or inappropriate facets

which got an Infit Mean Square value greater than 2.0.

101

SUN Haiyang

References

Allen M. J. & Yen, W. M. 1979. Introduction to Measurement Theory. Monterey: Brooks/Cole

Publishing Company.

Bachman, L. F. 1990. Fundamental Considerations in Language Testing. Oxford: Oxford University

Press.

Bachman, L. F. 2004. Statistical Analyses for Language Assessment. Cambridge: Cambridge University

Press.

Bachman, L. F., Lynch, B. K. & Mason, M. 1995. Investigating variability in tasks and rater

judgments in a performance test of foreign language speaking. Language Testing, 12, 238-257.

Baker F. B. 1985. The Basics of Item Response Theory. Portsmouth, NH: Heinemann Educational

Books.

Baker, R. 1997. Classical test theory and item response theory in test analysis. Language Testing

Update Special Report No. 2. Lancaster: University of Lancaster.

Bechger, T. M. et al. 2003. Using classical test theory in combination with item response theory.

Applied Psychological Measurement, 27, 319-334.

Brown, J. D. 1996. Testing in Language Programs. Upper Saddle River, NJ: Prentice Hall

International.

Crocker, L. & Algina, J. 1986. Introduction to Classical and Modern Test Theory. Florida: Holt

Rinehart Winston.

Fan, X. 1998. Item response theory and classical test theory: An empirical comparison of their item

/ person statistics. Educational Psychological Measurement, 58, 357-381.

Fulcher, G. & Davidson, F. 2007. Language Testing and Assessment: An Advanced Resource Book.

London and New York: Routeledge.

Hambleton, R. K. & Jones, R. W. 1993. Comparison of classical test theory and item response theory

and their applications to test development. Educational Measurement: Issues and Practice, 12,

253-262.

Harris, D. 1989. Comparison of 1-, 2- and 3-parameter IRT models. Educational Measurement:

Issues and Practice, 8, 35-41.

Henning, G. 1987. A Guide to Language Testing: Development, Evaluation, Research. Cambridge,

Mass.: Newbury House.

Hughes, A. 1989. Testing for Language Teachers. Cambridge: Cambridge University Press.

Kline, T. 2005. Psychological Testing: A Practical Approach to Design and Evaluation. Thousand Oaks:

Sage Publications.

Kondo-Brown, K. 2002. A FACETS analysis of rater bias in measuring Japanese second language

writing performance. Language Testing, 19, 3-31.

Linacre, J. M. 1989. Many-facet Rasch Measurement. Chicago: MESA Press.

Lord. F. M. 1980. Application of Item Response Theory to Practical Testing Problem. Hillsdale, NJ:

Erlbaum.

Lumley, T. & McNamara, T. F. 1995. Rater characteristics and rater bias: Implications for training.

Language Testing, 12, 54-71.

Lunz, M. E. & Stahl, J. A. 1990. Judge consistency and severity across grading periods. Evaluating

and the Health Professions, 13, 425-444.

102

An Application of Classical Test Theory and Many-facet Rasch Measurement…

Lynch, B. K. & McNamara, T. F. 1998. Using G-theory and Many-facet Rasch measurement in the

development of performance assessments of the ESL speaking skills of immigrants. Language

Testing, 15, 158-180.

Lynch, B. K. 2003. Language Assessment and Program Evaluation. Edinburgh: Edinburgh University

Press.

McNamara, T. F. 1996. Measuring Second Language Performance. London and New York: Addison

Welsley Longman.

Myfold, C. M. & Wolfe, E. W. 2003. Detecting and measuring rater effects using many-facet Rasch

measurement: Part I. Journal of Applied Measurement, 4, 386-422.

Myfold, C. M. & Wolfe, E. W. 2004. Detecting and measuring rater effects using many-facet Rasch

measurement: Part II. Journal of Applied Measurement, 5, 189-227.

Pollitt, A. & Hutchinson, C. 1987. Calibrated graded assessments: Rasch partial credit analysis of

performance in writing. Language Testing, 4, 72-92.

Rasch, G. 1977. On specific objectivity: An attempt at formalizing the request for generality and

validity of scientific statements. The Danish Yearbook of Philosoph,14, 58-93.

Rasch, G. 1980. Probabilistic Models for Some Intelligence and Attainment Tests. Denmark: Danish

Institute for Educational Research; Chicago: MESA Press.

Saif, S. 2002. A needs-based approach to the evaluation of the spoken language ability of

international teaching assistants. Canadian Journal of Applied Linguistics, 5, 145-167.

Shameem, N. 1998. Validating self-reported language proficiency by testing performance in an

immigrant community: The Wellington Indo-Fijians. Language Testing, 15, 86-108.

Shohamy, E., Gordon, C. M. & Kraemer, R. 1992. The effect of rater’s background and training on

the reliability of direct writing tests. The Modern Language Journal, 76, 27-33.

Shultz, S. K. & Whitney, D. J. 2005. Measurement Theory in Action: Case Studies and Exercises. New

Delhi and London: Sage Publications.

Sudweeks, R. R., Reeve, S. & Bradshaw, W. S. 2005. A comparison of generalizability theory and

many-facet Rasch measurement in an analysis of college sophomore writing. Assessing

Writing, 9, 239-261.

Valette, R. 1977. Modern Language Testing. New York: Harcourt Brace.

Weigle, S. C. 1998. Using facets to model rater training effects. Language Testing, 15, 263-287.

Weir, C. 2005. Language Testing and Validation: An Evidence-based Approach. Basingstoke: Palgrave

Macmillan.

Weir, C. & Wu, J. R. 2006. Establishing test form and individual task comparability: A case study of

a semi-direct speaking test. Language Testing, 23, 167-197.

Wright, B. D. & Linacre, J. M. 1994. Reasonable mean-square fit values. Rasch Measurement

Transactions, 8, 370.