testing reading

ITEM EVALUATION OF THE READING TEST OF THE MALAYSIAN UNIVERSITY ENGLISH TEST (MUET)

RUSILAH BINTI YUSUP

Submitted in partial fulfilment of the requirements for the degree of Master of Assessment and Evaluation

Melbourne Graduate School of Education (MGSE)

The University of Melbourne

September 2012

i

ABSTRACT

The present study is an item-level evaluation of the reading test of the Malaysian

University English Test (MUET), the high-stakes entrance test for Malaysian pre-degree

students. It comprises an in-depth analysis of student responses at item level as an

explanation why this compulsory entry test appears to be formidable challenge for test

takers. The study aims to assess the quality of the test items from the framework of two

widely used psychometric theories – classical test theory (CTT) and the Rasch model.

Additionally, it examines the effects of item features and examinees’ characteristics in

determining the difficulty level of the test items. These two issues have been explored

by using regression analysis and differential item functioning (DIF) respectively. The

findings of item analysis demonstrate the complementary nature of CTT and the Rasch

model as useful tools for test design and evaluation. The study also reports that item

difficulty of this reading test is influenced largely by question format features

(particularly plausibility of the distractors) rather than passage-related variables and

question-type variables. DIF analysis points out that natural/real differences are seen as

a possible explanation for variation between the various groups being examined. These

findings, though subject to limitations, have practical implications for instruction, test

construction and educational research. Also, it provides directions for future exploration

of several issues identified through this study. Due to the limitation of the study, it only

focuses on one of the four components of MUET, that is reading test.

ii

DECLARATION

This thesis does not contain material which has been accepted for any other degree in

any university. To the best of my knowledge and belief, this thesis contains no material

previously published or written by any other person, except where due reference is

given in the text.

Signature: .............................................................

iii

ACKNOWLEDGEMENT

First and foremost, praises to Allah, the Most Gracious the Most Merciful, by His Grace

and Will alone have made this journey possible.

This research would not have succeeded without the support of several people, to whom

I am greatly indebted.

My sincere and heartfelt gratitude to my dissertation supervisors, Associate Professor

Esther Care and Associate Professor Margaret Wu, for their guidance, encouragement

and thoughtfulness throughout this challenging journey. I am honoured to have been

given the opportunity to work with these brilliant and kind individuals.

I am also very grateful to the Ministry of Education Malaysia for granting me a study

leave and scholarship in my pursuit of academic development.

Last but not least, to my beloved family and friends, my deepest gratitude for their

endless love and unwavering emotional support.

Thank you all.

iv

TABLE OF CONTENTS

ABSTRACT ..................................................................................................................... i

DECLARATION ............................................................................................................. ii

ACKNOWLEDGEMENT ............................................................................................. iii

TABLE OF CONTENTS ............................................................................................. iv

LIST OF TABLES …..................................................................................................... vi

LIST OF FIGURES ....................................................................................................... vii

CHAPTER 1 INTRODUCTION

1.1 Introduction ....................................................................................................... 1

1.2 Problem Statement ............................................................................................. 4

1.3 Aim / Purpose of the Study ............................................................................... 6

1.4 Research Questions ........................................................................................... 9

1.5 Significance and Limitation of the Study ....................................................... 10

1.6 Structure of the Thesis ..................................................................................... 11

CHAPTER 2 A REVIEW OF LITERATURE

2.1 Introduction ……............................................................................................ 13

2.2 Assessment of Reading Comprehension ......................................................... 14

2.3 Psychometric Item Analysis of the MUET Reading Test ............................... 17

2.4 The Effects of Test Features and Examinee’s Characteristics on Item Difficulty of Reading Test ...................................................................... 26

2.5 Review of Previous Studies ............................................................................. 34

2.6 Summary ......................................................................................................... 37

v

CHAPTER 3 METHODOLOGY

3.1 Introduction .................................................................................................... 39

3.2 Description of the Data .................................................................................... 40

3.3 Description of the Materials ............................................................................ 40

3.4 Description of the Procedures .......................................................................... 42

3.5 Summary .......................................................................................................... 51

CHAPTER 4 FINDINGS OF THE STUDY

4.1 Introduction ..................................................................................................... 52

4.2 Results of Item Analysis .................................................................................. 52

4.3 Relationship between Item Characteristics and Item Difficulty...... ……........ 66

4.4 Results of DIF Analyses .................................................................................. 72

CHAPTER 5 DISCUSSION AND CONCLUSION

5.1 Introduction ..................................................................................................... 79

5.2 Discussion of Major Findings .......................................................................... 79

5.3 Implications of the Findings ............................................................................ 84

5.4 Directions for Future Research ........................................................................ 85

5.5 Conclusion ....................................................................................................... 86

REFERENCES .............................................................................................................. 87

APPENDICES ............................................................................................................... 95

vi

LIST OF TABLES

Table 1.1 Analysis of MUET-END 2009 ........................................................... 6

Table 2.1 Classification of Discrimination Index (Ebel & Frisbie, 1991) .......... 20

Table 3.1 Classification of Item Facility Index (Henning, 1987) ....................... 45

Table 4.1 Categories of CTT Item Facility Index ............................................... 53

Table 4.2 Categories of CTT Discrimination Index ............................................ 56

Table 4.3 Summary of CTT Analyses of Item 34 and Item 38 .......................... 60

Table 4.4 Examples of Misfit Items .................................................................... 63

Table 4.5 Summary of CTT Item Analysis ......................................................... 64

Table 4.6 Summary of Reliability Analyses ....................................................... 65

Table 4.7 Characteristics of the MUET Reading Items ...................................... 66

Table 4.8 Results of Linear Regression .............................................................. 67

Table 4.9 Results of Coefficient Analysis ........................................................... 68

Table 4.10 Interaction between Structure of Responses and Inference Type …... 71

Table 4.11 Summary of Overall Performance between Males and Female ……. 72

Table 4.12 Parameter Estimates of Gender DIF Investigation ............................. 73

Table 4.13 Summary of Overall Performance between Ethnics and States........... 76

Table 4.14 Items with the Largest DIF Indices ..................................................... 78

vii

LIST OF FIGURES

Figure 1.1 Flow chart of MUET Use for Pre-degree Students ............................ 5

Figure 2.1 Location of Person and Item Parameter ............................................ 22

Figure 2.2 High Discriminating Item with MNQS 0.83………………………... 24

Figure 2.3 Poor Discriminating Item with MNQS 1.25………………………… 24

Figure 2.4 ICC Plot of Item with DIF ................................................................. 32

Figure 2.5 ICC Plot of Item without DIF ........................................................... 33

Figure 3.1 Probability of Success on an Item ..................................................... 47

Figure 4.1 The Item and Latent Distribution Map ............................................. 54

Figure 4.2 ICC Plot of Difficult Item (Item 38) ................................................. 55

Figure 4.3 ICC Plot of Easy Item (Item 26) ......................................................... 55

Figure 4.4 ICC Plot of High Discriminating Item (Item 2) .................................. 57

Figure 4.5 ICC Plot of Low Discriminating Item (Item 8) .................................. 58

Figure 4.6 ICC Plot of Item with Negative Discrimination Index (Item 34)........ 61

Figure 4.7 ICC Plot of Item with Negative Discrimination Index (Item 38)........ 61

Figure 4.8 ICC Plot of Misfit Item (Item 34)........................................................ 63

Figure 4.9 Boxplot of Interaction between Plausibility of Distractors and Item Difficulty

69

Figure 4.10 Boxplot of Interaction between Inference Level and Item Difficulty 69

Figure 4.11 Boxplot of Interaction between Structure of Responses and Item Difficulty

70

Figure 4.12 ICC Plot of Item with Gender DIF (Item 6) ...................................... 74

Figure 4.13 ICC Plot of Item without Gender DIF (Item 15) ............................... 75

Figure 4.14 ICC Plot of Item with Ethnicity DIF (Item 30) ............................... 77

1

CHAPTER 1: INTRODUCTION

1.1 INTRODUCTION

Broadly speaking, the purpose of educational testing and assessment is straightforward;

to measure or gauge what learners know or can do. Testing or assessment is seen as a

process of gathering evidence to infer the performance of an individual

(McNamara,2000). In the educational context, many educators recognize that tests play

a crucial role as an arsenal of tools to measure student achievement.

In the Malaysian context, the outcome of assessment through standardized tests1 is seen

as the linchpin to track how well students perform throughout their schooling years.

Stakeholders in education, particularly students, teachers and parents, make inferences

about students’ overall performance from national standardized high-stakes

examinations like Penilaian Menengah Rendah (PMR or Lower Secondary Evaluation),

Sijil Pelajaran Malaysia (SPM or Malaysian Certificate of Education) and Sijil Tinggi

Pelajaran Malaysia (STPM or Malaysian Higher School Certificate Examination).

After PMR for example, students will be streamed into either a Science stream or Arts

stream based on their results. Those with distinctive results in SPM will have the

advantage of entering matriculation centres2 which offer a pre-university programme for

Malaysian ‘bumiputera’ students as a preparation for them to qualify to Degree

Programmes in the fields of Science and Technology in both local and overseas

universities.

The results of standardized tests serve a variety of purposes in educational settings. At

the individual level, Mertler (2007) asserts that test scores are used to describe one’s

learning abilities and levels of achievements. The information helps students to identify

1 A test that is developed, administered and scored in a predetermined standard manner.

Students take the same set of exam questions, marked with the same marking scheme and graded using the same grading system.

2 Centres for foundation studies which offer one or two-year programmes run by the Ministry of Education.

2

their areas of strength and weakness (Schwartz, 1984) and this guides them to modify or

adapt to the instruction based on their own needs (Mertler, 2007). In addition, test

results can provide useful information at group level. Often, test results are used to

compare students with other students (Schwartz, 1984). Tests serve as an indicator of

general ability levels of students across classes, grade levels, schools or states.

Over the decades, we have witnessed a change in the use of educational testing.

Educational tests no longer serve primarily as indicators of educational achievement.

Tests have become an effective policy device to implement changes or modifications to

educational policies (Baker, 1989) and to monitor the effectiveness of instruction or

academic courses (Bachman & Palmer, 1996). Test results are now used to evaluate

teachers, administrators, and even the quality of an entire curricular and instructional

program. As an example, many higher educational institutions make use of scores from

standardized tests as the sole, mandatory, or primary criterion for admissions or

certification. Therefore, education stakeholders should view the results of tests as a

source of information which needs to be put into good use to reach appropriate

decisions about students, instruction and curriculum at large.

1.1.1 English Language in Malaysia

Owing to the legacy of British, the English language has been spoken in Malaysia for

decades. From pre-independence days until today, it has been widely spoken and is

therefore considered the second language of the country. It has been used extensively in

commercial and social settings, formal and informal situations – in business

transactions, internet communication, advertisement and entertainment industry. In

government administration, although Malay is the official language, English usage is

frequent and necessary in many international transactions and correspondences. To a

certain extent, English has become part and parcel of the life of Malaysians. As an

example, failure in securing jobs after graduation is often linked to the inability to

communicate effectively in English. It is also a common notion in Malaysia that one’s

success in today’s competitive global world is associated with the mastery of the

English language.

3

Due to its importance, English has been made a compulsory subject taught and tested as

a second language from the first year of an individual's primary education to the end of

his/her secondary education in Form Five. Unfortunately, prior to 1999, English was not

taught or tested at the Sixth Form or pre-university level. However, upon entry into the

local public tertiary institutions, these pre-1999 students were required to undergo a

course in English language proficiency. This is because at the tertiary level, although

the medium of instruction in the public universities is the national language (Malay),

English is widely used to teach science and mathematics-related subjects or courses.

It was with the dual purpose of filling the gap with respect to the training and learning

of English and that of consolidating and enhancing the language literacy of the Sixth

Form and pre-university students, that the Malaysian University English Test (MUET)

was first introduced in 1999, along with a curriculum/syllabus for delivery at Sixth

Form and equivalent level.

MUET is administered twice a year, i.e. at mid-year (April/May) and year-end

(October/November). The test is developed and run by the Malaysian Examination

Council3. It is a test to measure the English language proficiency of pre-university

students for entry into tertiary education. It is a mandatory test to gain entry into degree

courses offered at all Malaysian public universities. Unlike the International English

Language Testing System (IELTS) and Test of English as a Foreign Language

(TOEFL) which are globally accepted as the certification of English language

proficiency, MUET is recognized only in Malaysia and Singapore (National University

of Singapore, Nanyang Technological University and Singapore Management

University).

MUET comprises the four language skills of listening, speaking, reading and writing. It

gauges and reports a candidate’s level of proficiency based upon an aggregated score

3 A statutory body under the Ministry of Education, which is solely responsible for the

development and administration of MUET. This body is not involved in the management of other high stakes examinations like PMR and SPM. These two standardized tests are run by the Malaysian Examination Syndicate.

4

ranging from zero to 300 which is then converted into a banding system ranging from

the lowest, Band 1 to the highest, Band 6.

The MUET syllabus aims to equip students with the appropriate level of proficiency in

English so as to enable them to perform effectively in their academic pursuits at tertiary

level. The syllabus is designed to bridge the gap in language needs between secondary

and tertiary education by enhancing communicative competency, providing the context

for language use that is related to the tertiary experience and developing critical

thinking through the competent use of language skills. In a broader sense, it aims to

prepare Malaysian university graduates to be able to compete effectively at the global

level which requires the mastery of the lingua-franca spoken all over the world.

1.2 PROBLEM STATEMENT

After having received the SPM examination results, qualified students may move on to

study in various higher learning institutions in the country. They can choose to enrol in

Form Six (pre-university level), a matriculation college, a teacher training institute, a

polytechnic or a community college. At this level, English is given considerable

emphasis. For example, English is taught in teacher training colleges and matriculation

centres to help students to enhance their English proficiency as well as to prepare them

for the MUET exam. Teaching English or the MUET syllabus for pre-university

students is therefore, seen as a consolidation phase or continuation of what they have

learnt in secondary schools.

Achievement in MUET acts as an indicator of a student’s language proficiency level

and enables him or her to enrol for undergraduate programmes at Malaysian public

universities or other higher learning institutions. For most universities, students must

obtain a higher band in MUET in order to be accepted in the faculties of Engineering,

Dentistry, Medicine and Law. In University Malaya, for example, students aspiring to

pursue Bachelor of Law and Bachelor of TESL need to pass with at least Band 4 or

equivalent. In another case, Band 5 for MUET is the minimum requirement for students

enrolling in the Faculty of Law in MARA University of Technology. Thus, to be

granted admission to their choice of programme, students must pass the MUET with a

5

satisfactory grade to meet the requirement outlined by the universities. The following

chart depicts the use of MUET for pre-degree students in sixth form and equivalent.

Figure 1.1: Flow Chart of MUET Use for Pre-degree Students

Despite its status as a hurdle requirement or mandatory language test for entry into

public universities, the exam is a formidable challenge for many students. The final

analysis of MUET-END 2009 by the Malaysian Examination Council (see Table 1.1),

for example, revealed that 89% of students fell below Band 3 (see the band descriptor in

Appendix A). It was noted that 39% of test-takers were categorised as limited users

(Band 2). Also, the percentage of students obtaining the upper bands (Band 4 - Band 6)

is small. This figure shows that the level of English language proficiency among

Malaysian students is at low ebb. These low results restrict many students’ chances of

entry to the programme of their choice.

SPM

Sixth Form Matriculation College Teacher training college

Polytechnic Community college

MUET

Undergraduate programmes in public universities

6

Table 1.1: Analysis of MUET-END 2009

Type of candidate

Total Band 6 5 4 3 2 1

No. % No. % No. % No. % No. % No. % Public school 27605 4 0.01 274 0.99 2047 7.42 7017 25.42 11731 42.50 6532 23.66

Private school 1308 0 0.00 26 1.99 72 5.50 192 14.68 428 32.72 590 45.11

Independent candidate

11213 3 0.03 226 2.02 1286 14.14 3799 33.88 4138 36.90 1461 13.03

Matriculation centre

21310 5 0.02 353 1.66 3132 14.70 10843 50.88 6737 31.61 240 1.13

Diploma holder 11302 0 0.00 28 0.25 526 4.65 3084 27.29 5394 47.73 2270 20.08

Bachelor holder 4322 0 0.00 52 1.20 429 9.93 1471 34.04 1878 43.45 492 11.38

State-funded school

2146 0 0.00 0 0.00 12 0.56 78 3.63 554 25.82 1502 69.99

TOTAL 79206 12 0.02 959 1.21 7804 9.85 26484 34.44 30860 38.96 13087 16.52

Many studies have been conducted to understand the factors which contribute to

students’ poor performance in English. A study conducted by Hamzah and Abdullah

(2009) found that ESL learners are unable to use the language because of a lack of

learning strategies. The result of the research showed that the respondents who

consisted of ESL learners in institutions of higher learning could not master the

language without proper training in metacognitive strategies in their ESL learning.

Other possible reasons for this problem are factors such as attitude, perception and

environment (Kaur & Thiyagarajah, 1999; Jalaluddin, Awal & Bakar, 2009). The

researchers revealed that social embarrassment fuelled the hesitation to use the

language. This means that students hesitate to practise the language and are more

comfortable communicating using their mother tongue. Moreover, Jalaluddin et al.

(2009) added that differences in language structures act as a barrier to acquisition of the

second language. The study demonstrated that structural differences between the first

language (i.e. Malay) and the target language (i.e. English) can inhibit mastery of the

language.

1.3 AIM / PURPOSE OF THE STUDY

A review of literature has indicated that the possible reasons for Malaysian students’

lack of English proficiency have been the object of numerous academic inquiries. The

emphasis has primarily been on extraneous variables such as students’ perception and

7

attitude, social environment and linguistic factors. It appears that these extraneous

variables are hindrances to Malaysian students mastering the language and eventually

this affects their performance in a language test, in this case MUET (Kaur &

Thiyagarajag, 1999; Jalaluddin, et al., 2009; Hamzah & Abdullah, 2009).

Unfortunately, studies that concentrate on the influence of the actual test items on the

difficulty level of a particular test have been minimal. To date, there have been no

previously published or unpublished studies undertaking a comprehensive exploration

of these psychometric issues in the Malaysian context, although analysis of this problem

has been hampered by restricted access to the test data.

Based on the results obtained from the previous administrations of MUET, many test-

takers of MUET struggle with the reading comprehension test. There is concern that

Malaysian students and graduates lack reading comprehension skills (Sarudin &

Zubairy, 2008). Malaysian university graduates also have been criticised for lacking

general reading skills to perform effectively at the workplace. Of the four components

in MUET, reading comprehension has been given the highest weighting, i.e. 40% of the

total score. This clearly shows that the Malaysian educational policy is concerned with

equipping students with reading skills to engage successfully in tertiary education. This

is due to the fact that in the second language learning context, reading is perceived as a

prominent academic skill for university students. Carrell (1988) acknowledges that:

It is through reading that learners are exposed to new information and are able to

interpret, evaluate and synthesize the course content. Yet, most often, many students

who enrol in higher learning institutions are unprepared for the reading demands of

academic life. Poor performance on MUET can be seen as one of the indicators of this

“In second language teaching/learning situations for academic purposes, especially in higher education in English-medium universities, or other programmes that make extensive use of academic reading materials written in English, reading is paramount. Quite simply, without solid reading proficiency, second language readers cannot perform at levels they must in order to succeed...”

(Carrell, 1988, p.1)

8

problem. Factors such as poor reading strategies, low interest in reading English

materials and reading habits are mentioned by researchers as the causes of reading

problems for Malaysian ESL learners (Ramaiah & Nambiar, 1993; Abdul Majid, Jelas

& Azman, 2002, Ibrahim, 2005, 2006). Obviously, reading comprehension is seen as

the key to unlocking success and thus warrants particular attention, especially in the

ESL context.

As mentioned earlier, a national standardized high stakes examination like MUET plays

a vital role in assessing Malaysian students’ academic achievement. MUET is used as a

means of entry to undergraduate courses in public universities. It is essential therefore

that the test is of high quality. Previous studies show that extraneous factors have been

the focal point of examining the poor achievement in the English language test. In the

Malaysian context, empirical research that focuses on the psychometric property of the

test at the item level has yet to receive due attention. The main purpose of the present

study is to address this gap and to examine the psychometric properties of the test (the

quality of the test) as well as to investigate the role played by test item characteristics as

the contributing factors for the difficulty level of MUET, particularly the reading

component.

Pumfrey (1976) and Schwartz (1984) summarized that the most important

characteristics of a good reading test are validity, reliability and practicality. The first

two characteristics are relevant to the present study. Therefore, for the purpose of this

research, it is necessary to investigate the quality of individual items in order to examine

the reliability and validity of this MUET reading test. This is because test developers

have recognized that reliability and validity of test scores are contingent upon the

quality of the test items (Reynolds, Livingston & Wilson, 2009). Logically, as the

quality of the individual items improves, the overall quality of the test also improves.

This process of item analysis is viewed as the key to the development of a successful

test as it provides insights about the pattern of students’ response to an item and the

relation of the item to the overall performance (Nunnally & Bernstein, 1994). The item

analysis in this research utilized the two analytic procedures that are commonly used in

test development and validation, namely traditional or standard item analysis within the

9

framework of Classical Test Theory (CTT) and the Rasch model, one of the models of

Item Response Theory (IRT).

The second aim of this research is to explore the influence of item features on the

difficulty of reading comprehension items. Investigation dwelled on several

characteristics of test items such as the type of question, length of the passage,

plausibility of distractors and number of alternatives as the contributing agents to item

difficulty. In this study, a regression analysis was conducted to investigate the

relationship between the selected item features/characteristics and the difficulty level of

the reading comprehension items.

Another purpose of this research is to examine the effect of students’ background

characteristics on their responses to items of the MUET reading test. Differential Item

Functioning (DIF) analysis was utilised to explore the extent to which the indicators of

DIF such as gender, geographical location and ethnicity, are likely to reflect students’

responses. In the case of high-stake examination, DIF analysis is important as the reality

of plurality in Malaysia should be taken into consideration in the construction of any

test item. Standardized tests, which, by definition, give all test-takers the same test

under the same (or reasonably equal) conditions, should ensure fairness regardless of

race, socioeconomic status, or other considerations.

1.4 RESEARCH QUESTIONS

Based on the above discussion, this research is designed to address the following

questions:

1. How do the items spread in terms of their difficulty value and ability of the

students?

2. How good are the items of the MUET reading test?

3. To what extent do the selected features of the test items contribute to the item

difficulty in the MUET reading test?

4. Is there any differential item functioning in the MUET reading test in terms of

gender, geographical location and ethnicity?

10

1.5 SIGNIFICANCE AND LIMITATION OF THE STUDY

It is hoped that this study will meet its objectives as mentioned earlier. Furthermore, it is

intended to provide sound information to the Ministry of Education generally, and to the

concerned divisions, particularly the Malaysian Examination Council and the Malaysian

Examination Syndicate. It is certainly a hope that the findings of this research will help

the institutions to implement quality control measures on their examination materials.

Information gained from this research can serve as a guide for those individuals who are

actively involved in the design and construction of test items, as it will provide a better

understanding of measurement complexity. More specifically, the findings from item

analysis of this study can be used by the Malaysian Examination Council test

constructors to design new sets of items for a more defensible reading test in MUET.

Item analysis indeed is valuable in improving items which will be used again in later

tests. It can also be used to eliminate ambiguous or misleading items. Popham (2000)

suggested that in large-scale-test development, empirical item-improvement through

item analysis should be given a major emphasis. It is this kind of empirical analysis that

facilitates the revision of test items.

The study of test features effects and DIF, in addition, can contribute to the

understanding of the effect of item features and examinee background characteristics on

the construct the test is intended to measure. The results of the effects of individual item

on the difficulty level of this reading test will inform the test writers to balance the

contents of particular features in the development of test items. The DIF analysis,

furthermore, will guide the test designers to control the possible causes of differences of

the groups being compared. Test characteristic effect and DIF investigation help test

evaluators to ensure test validity (Osterlind & Everson, 2009; Camilli & Shepard, 1994)

and to make decisions on the interpretation of a test score.

Due to time constraints, this research examines the factors affecting the item difficulty

of the reading test only. Thus, the findings of the study cannot be seen as the whole

performance and quality of MUET which consists of three other language skills;

listening, speaking and writing. In addition, the statistical analysis generated from the

11

data only relies on the features of model used; the Rasch model. It does not deal with

the other derivations of CTT (e.g., generalizability theory) and IRT models (e.g., two-

parameter model and three-parameter model).

Due to its limitation, the findings of this study cannot be generalized to the whole

population because the participants are limited to those candidates of MUET-END 2009

from two states; Sabah and Capitol Territory of Kuala Lumpur only.

1.6 STUCTURE OF THE THESIS

This thesis is divided into five chapters.

Chapter 1 provides introductory information for the study. The importance of English

language in Malaysian setting and the implementation of MUET for pre-university

students is described. The problems of low English proficiency among Malaysian

students are also discussed. This chapter introduces the aims of the study and the

research questions which need to be addressed in this study.

Chapter 2 outlines a review of literature on the topics of interest in this study. First, it

highlights the psychometric properties of CTT and the Rasch model for the item

analysis. It explains the item facility, item discrimination, reliability and fit index,

which are used to check the quality of the overall test. Second, it looks at the type of

several test characteristics which have been examined by the researchers to influence

the item difficulty of a test. Third, the discussion of the meaning of DIF and its relation

to bias is then presented. At the end, this chapter reviews the previous research related

to this study.

Chapter 3 describes the methodology utilized for this research. This includes the

description of the data/sample and the materials. This section also outlines the three

phases which are conducted in order to investigate the answers of the research

questions. The three procedures involved in the study are:

Item analysis of CTT and the Rasch model

12

Coding of individual items and regression analysis

DIF analysis

Chapter 4 presents the findings of the study. The first section describes the results of

the CTT and the Rasch item analysis. Next, the findings of the investigation on the six

predictors of item difficulty are discussed. The last part of this chapter reports the extent

to which the DIF indicators (gender, ethnicity and state) influence students’ response to

an item.

Chapter 5 gives the main conclusions from the findings of the study by providing

answers to the research questions. Implications of the findings for teaching and testing

MUET and further research are also given in this chapter.

13

CHAPTER 2: A REVIEW OF LITERATURE

2.1 INTRODUCTION

In the context of second language acquisition, reading is by far the most important skill

to be learnt (Carrell, 1988). Certainly, many learners of English language find

themselves engaged in reading most of the time in order to master the language. In

addition, the ability to read is a central asset in today’s modern, technologically-oriented

world. Numerous research findings have shown strong links between reading

proficiency and success in educational contexts at all ages; from the primary school to

university level (Adamson, 1993; Collier, 1989). In higher educational institutions that

make extensive use of academic materials written in English, reading is arguably the

basic foundation on which academic skills of the individual are built. In academia, most

subjects taught are based on a simple process – read, synthesize, analyze and process

information. Simply put, students’ performance in tertiary level is contingent upon their

reading proficiency.

Recognizing the importance of reading as a part of academic literacy, it is no surprise

that there have been many attempts to measure reading skills. Students’ reading ability

is frequently assessed using standardized tests. Today, there are dozens of commercial

reading tests, and for the purpose of English as a foreign/second language assessment,

the most frequently used tests are TOEFL and IELTS, that can be used as a means to

determine the attainment in or attitude towards reading (Pumfrey, 1976). These tests are

assumed to gauge reading ability which requires the test-takers to read various types of

passages and to respond to questions about the passage. The nature of the multiple-

choice format, which characterizes many standardized tests, provides an objective way

to determine the correct and incorrect responses. Many educators and researchers favour

this type of standardized test mainly due to its practicality (Brown, 2004).

14

2.2 ASSESSMENT OF READING COMPREHENSION

Devine (1989) defines reading comprehension as:

The above definition of reading comprehension implies that reading is a dynamic and

complex process (Pumfrey, 1976; Alderson, 2000; Devine, 1989; Carrell, 1988;

Schwartz, 1984). This means that readers are involved in an active process to construct

meaning from print or writing. The interactivity nature of reading, nonetheless, poses

challenges to the test design of reading skills. Ample studies have demonstrated that

reading assessment has particular complexities (Weaver & Kintsch, 1991; Klapper,

1992) due to the complex and active interactions between reader, text and task.

The first major challenge is that reading comprehension involves dynamic and multi-

component processes (Fletcher, 2006; Snow, 2003). Readers use a variety of reading

strategies to decipher the meaning of a written text. For example, readers may use

semantic, syntax and context clues to make sense of the meaning of unknown words.

They may also use various cognitive skills such as inferring, reasoning, predicting,

comparing and contrasting to draw conclusion of their interpretation of the text.

Readers also need to integrate the words they have read with their prior knowledge,

experience, attitude and language (in the case of second/foreign language context, this

refers to the interference of first language into the reading process). These complicated

activities pose an essential question; how do test constructors decide which aspect of

reading to measure? It appears that the complexity of cognitive process to derive

“Reading comprehension is the process of using syntactic, semantic, and rhetorical information found in printed texts to reconstruct in the reader’s mind, using the knowledge of the world he or she possesses (plus appropriate cognitive and reasoning abilities when necessary), a hypothesis or personal explanation that may account for the message that existed in the writer’s mind as the printed text was prepared.”

(Devine, 1989, p.120)

15

meaning challenges the test designers to accurately measure the many skills required for

a particular reading test.

Alderson (2000) and McKay (2006) supported the above notion and confirmed that

reading is both process and product. The process of reading is a reader-text interaction

which involves many different things that are going on when a reader reads. The

product of reading is comprehension or construction of meaning; that is, the

understanding of what has been read. Both need to be assessed. Alderson asserted that

any variable that has impact on either reading process or its product needs to be taken

into account in test design and test validation. He also noted that assessing the process

of reading can be a challenging task for educational practitioners.

Additionally, Pumfrey (1976) pointed out that “reading is characteristically

developmental” (p.11). This suggests that the skills required by young readers

inevitably differ from adult learners especially those at the tertiary stages of education.

Thus, the relative importance of particular reading skills at a given stage should be a

prime concern in designing items for reading assessment.

Another challenge of reading assessment is that, like listening, it is often associated with

the measurement of other skills (Mckay, 2006). For instance, judgement of student’s

reading ability is observable through speaking or writing. Therefore, care needs to be

taken so that assessment of reading will not be ‘contaminated’ by other skills. In regard

to the integration of reading with other language skills, its unobservable nature calls for

the assessment to be carried out by inference (Brown, 2004). As a result, this leads to a

challenge in the justification or interpretation of the test. The irony here is that the

interpretation of testing comprehension of receptive skills has become a controversial

issue due to the reality that different readers infer from or interpret a written text in

different ways (Alllison, 1999).

The preceding challenges, nonetheless, lead to another crucial issue in reading

assessment, that is, construct validity. As described in the Standards for Educational

and Psychological Testing (AERA, APA, & NCME, 1999), validity refers to “the

16

degree to which evidence and theory support the interpretations of test scores entailed

by proposed uses of tests” (p. 9) and a construct is defined as “the concept or

characteristic that a test is designed to measure” (p. 173). In its simplest terms, construct

validity refers to multiple sources of evidence supporting or refuting the accurate

interpretation of test score (Messick, 1995).

Leading scholars of language testing have identified two major threats to score validity:

construct underrepresentation and construct-irrelevant variance. Messick (1996)

asserted that the validity of the test is affected by an inadequate or incomplete sampling

of the construct (construct underrepresentation) and the measurements of ‘things’ that

are simply not relevant to a construct (construct-irrelevant variance).

It has been repeatedly noted that any sources of construct-irrelevant variance may lead

to incorrect inference of the test takers, and therefore, diminish validity (McNamara,

2000; Alderson, 2000).

It is clear that a construction of reading assessment requires a series of decisions. It is a

demanding task for test writers to decide what skills to measure, how to measure them

and how to interpret the test score. Despite its challenges, it has been acknowledged that

assessment of reading plays a crucial role in educational practice and research. It is

claimed that reading test results can be used as an indicator for evaluation of various

approaches to the teaching of reading (Pumfrey, 1976; Schwartz, 1984) and for

improvement of reading comprehension ability (Snow, 2003). This is because of the

positive washback4 of reading assessment that provides strategies for researchers and

teachers to identify and diagnose reading comprehension problem in students. The

importance of reading assessment then justifies that research on comprehension

assessment is paramount. In his introductory note of Alderson’s (2000) book, Bachman

recognized that “reading, through which we can access worlds of ideas and feelings, as

4 Generally, washback refers to the effect of testing on the process of teaching and learning.

Bachman and Palmer (1996) consider washback to be a subset of a test impact on a larger context; educational system and society.

17

well as the knowledge of the ages and visions of the future, is at once the most

extensively researched and the most enigmatic of the so-called language skills” (p. x).

2.3 PSYCHOMETRIC ITEM ANALYSIS OF THE MUET READING TEST

Both sets of stakeholders, teachers and students, perceive MUET as a high stakes test.

Due to its significance as a mandatory requirement for admission into public

universities, it is, therefore, essential to assess the reliability and validity of this test. In

other words, as part of evaluation practice, it is fundamental to review test items after

they have been constructed or administered.

The process of evaluating the effectiveness of individual items in a test is called item

analysis. It is normally conducted for the purpose of item selections in the construction

and revision phases of the test. In addition, it is also performed to investigate how well

the items are working with a target group of students. Nunnally and Bernstein (1994)

highlighted that item analysis is extremely useful as it furnishes important information

how examinees respond to each item and how each item relates to the overall

performance of the test. In this study, all 45 items of the MUET reading test are

scrutinized for statistical analysis using the framework of CTT and the Rasch model.

It should be emphasized here that the purpose of this paper is not to compare the two

approaches; but to demonstrate how they complement each other as a tool for

educational assessment. The discussion of psychometric characteristics of CTT and the

Rasch model in this section will necessarily be an overview, without extensive recourse

to the mathematical equations of the concerned properties, and the contentious

arguments about which particular approach is superior.

2.3.1 Classical Test Theory (CTT)

CTT is derived from a relatively simple assumption. CTT statistics are based on the

total scores on a test. It assumes that total scores, typically defined as the number of

correct responses, serve as the sole indicator of a person’s level of ability or knowledge

(de Ayala, 2009). Obviously, in CTT, the examinee’s attained score on the whole test is

18

the unit of focus. Hambleton and Jones (1993) acknowledge that the major advantage of

CTT is its “relatively weak assumptions”, which makes it easy to apply in many testing

situations. It is considered to be “weak” because the above assumption is likely to be

met by the data.

Within this theoretical framework, it is postulated that the score obtained by an

individual is made up of two facets; a true score and a random error (de Ayala, 2009;

Hambleton & Jones, 1993). The theory concludes that the observed score is a function

of the true score plus the random error. The relationship between the three components

is written as in the equation (2.1)

X = T + E

Where

X is the observed score

T is the true score

E is the error score

CTT theorizes that each person has a true score. It is calculated by taking the mean

score that he or she obtains on the parallel tests administered at infinite number of

testing sessions (Hambleton & Jones, 1993; de Ayala, 2009, Lord, 1980).

In the next section, the major features of CTT in evaluating the quality of items in a test

are outlined.

Item Facility Index

In traditional analysis, item facility index is used to describe the difficulty of an item. It

is normally determined from the proportion of the total group selecting the correct

answer to that question. Psychometricians (e.g Barnard, 1999; Baker, 1989; Reynolds et

al., 2009) define difficulty index, also known as p-value, as the percentage of the group

who answered the items correctly.

19

The range of facility index is from 0% to 100%, or more typically written as a

proportion as 0.0 to 1.00. A value of p =100% indicates that all the students selected the

correct answer and so that item is very “easy”. A value of 0 indicates that none of the

students selected the correct answer and so that item is very “difficult”. Simply put, the

higher the p-value, the easier the item.

Test constructors use this facility index to rank the items in a subset (Baker, 1989).

Notably, it is common practice in a test construction that a subtest starts with an easy

item. From a motivational perspective, this is deliberately done to get students relaxed

and confident so their exam anxiety can be lowered.

Ebel and Frisbie (1991) concluded that the value of difficulty index reflects the content

of the item and the ability of the group responding to the item.

Discrimination Index

Item discrimination is another property used in CTT as a guiding principle to assess the

quality of test items. It refers to the ability of an item to differentiate among students on

the basis of how well they know the material being tested. It evaluates the extent to

which item responses discriminate between high achievers and low achievers.

The item discrimination index can take on negative values and can range between -1.00

and 1.00. A discrimination index value of 1 is considered a perfect positive

discriminator. A value of 0 means no relationship between score on this item and

overall score and so the question does not discriminate between the two sub-groups of

students. High positive correlation is obtained for items that high-scoring students on

the test tend to get the item right and low-scoring students on the test tend to get wrong.

Such items are interpreted to be high in discrimination. Negatively discriminating items,

on the other hand, show the opposite relationship (Ebel & Frisbie, 1991). Obviously,

high discriminating items are able to divide students into two subgroups: upper group

and lower group or high achiever and lower achiever.

20

Table 2.1: Classification of Discrimination Index (Ebel & Frisbie, 1991)

Discrimination Index Item Description 0.40 and above Very good item

0.30 to 0.39 Reasonably good

0.20 to 0.29 Marginal items, subject to improvement

Below 0.19 Poor items, rejected or revised for improvement

The classification of the discrimination index by Ebel and Frisbie (1991) in Table 2.1

depicts a general rule of thumb in interpreting the discrimination index. Obviously, the

higher the index, the better the item differentiates the groups of higher achiever and low

achiever.

Reliability

Test reliability is the most significant feature of CTT and it is commonly stated as a

prerequisite that a test must attain a certain level of reliability to be considered of

sufficient quality for practical used (Adams, 2005; Nunnally & Bernstein, 1994;

McNamara, 1996). Therefore, in CTT, it is often employed to evaluate the overall

performance of the whole test. Theoretically, it can be defined as the proportion of

observed-score variance due to true-score variance (Ebel & Frisbie, 1991; Wright, 1999;

Crocker & Algina, 1986). In a simple language, the value of reliability indicates how

much of the variability in observed scores can be explained by the fact that examinees

differ from one to another in the trait being measured.

Simply put, reliability is supposed to say something about the general quality of the test

scores in question. In theory, reliability coefficients hover in value from 0 to 1.

Nonetheless, in practice, it is difficult to obtain perfect reliability due to many random

errors affecting the consistency of the scores. The general idea is to try to minimize

these inevitable errors of measurement so that higher reliability can be achieved. Thus

the higher the reliability, the better the quality of the test is. High reliability means that

the questions of a test tend to ‘pull together’ where students who answer a given

question are likely to answer the other questions correctly. If another parallel test is

administered at a different time, the scores would not indicate much changes. Low

21

reliability, in contrast, shows that the questions seemed unrelated to each other in terms

of who answer them correctly.

2.3.2 The Rasch Model

Unlike CTT, the key feature on which IRT is based, is that there is underlying latent

trait that is being measured (Hambleton, Swaminathan & Rogers, 1991). That is why

this approach is also known as latent trait theory to emphasize this idea. Under this

notion, the unobservable nature of the trait is manifested by the item responses and the

test score (Wu, 2010; Wu & Adams, 2008). This means that the performance of an

individual in a test is seen as the predictor of his or her ability level.

Following the above basic principle, the Rasch model, the one-parameter IRT model

developed by a Danish mathematician Georg Rasch (1960), focuses on the pattern of

item responses. It applies a mathematical function that specifies the probability of a

discrete outcome, such as a correct response to an item, in terms of person and item

parameters. In contrast to traditional framework, the Rasch model is probabilistic in

nature (Bond & Fox, 2007; Hambleton et al., 1991; Henning, 1987). The assumption of

this model is based on the probability of success on an item can be completely

determined by two values: an item difficulty and a person ability . This probability

function, known as item characteristic curve (Hambleton et al., 1991; de Ayala, 2009)

can be explained by equation (2.2):

exp1exp1XPp

Where

p = P( X = 1 ) is the probability of correct response

is the person-parameter

is the item-parameter, generally known as item difficulty

22

Equation is given for the case of dichotomous5 data where (X= 1) indicates success (the

correct response) on the item, and (X= 0) means failure (the incorrect response).

In analysing the quality of a test, here are the main elements of Rasch model which need

to be examined.

Item Difficulty and Person Estimate Ability

Initially, in the Rasch Model, person ability and item difficulty are conjointly estimated

and placed on a single numerical scale called logit (log odds unit). Person-parameter

and item-parameter are aligned on this scale where the probability of success is

routinely defined at 0.5 (Hambleton et al., 1991; Bond & Fox, 2007, Lord, 1980). That

is, a person’s ability measure is set at the point where he or she has a 50 percent chance

to either succeed or to fail. The logit scale is expressed according to an interval scale

where mean and standard deviation are arbitrary (Nunnally & Bernstein, 1994; Bond &

Fox, 2007). Therefore, the estimations of item-parameter and person-parameter are

about relative estimation, not an absolute measure.

3

basic knowledge

Task Difficulties

advanced

knowledge

Location of a student

1

2

6

3

4

5

Figure 2.1: Location of Person and Item Parameter (Wu & Adams, 2008 )

5 Dichotomous item is an item with only two types of response categories; correct (1) and

incorrect (0)

23

Figure 2.1 depicts the item-person map which constitutes an important difference

between Rasch measurement and CTT. As can be seen, the locations of the items and

the locations of students are calibrated on the same continuum. The upper end of the

continuum indicates greater ability level than the lower end. This suggests that items

located at the upper end require students to have advanced knowledge to answer the

items correctly. On the other hand, items at the lower end are assumed to deal with

questions of basic knowledge. It implies that these items are easy items because the

probability of students responding correctly to them is higher than those at the upper

end.

Item Discrimination

Under the Rasch model, item discrimination is called ‘equal discrimination’ or ‘equal

slope parameter’. The model assumes that all items have the same discriminating power

in measuring the latent variable of the object (Hambleton et al., 1991). This technical

property describes how well an item can differentiate between examinees having

abilities below the item location and those having abilities above the item location. The

assumptions of same discrimination index across items can be examined through the

ICC plot (Wu, 2010; Woods & Baker, 1985) and the mean square (MNSQ) statistics

(Wu & Adams, 2008).

The concept of discrimination is illustrated in Figure 2.2 and Figure 2.3. As seen in

Figure 2.2, the steeper the curve the better the item can discriminate. It is also noted that

the MNSQ statistics is less than 1. On the other hand, Figure 2.3 demonstrates poor

discriminating item with a flatter curve and MNSQ statistics that is greater than 1.

24

Figure 2.2: High Discriminating Item with MNSQ 0.83

Figure 2.3: Poor Discriminating Item with MNSQ 1.25

Fit Statistics

In any application of the IRT model, it is important to assess to what extent the IRT

model assumptions are valid for the given data and how well the testing data fit the IRT

model selected for use in a particular situation (Wright, 1999; Wu & Adams, 2008;

25

Hambleton et al., 1991). Item fit statistics are used to show how different an item is

with respect to the rest of the items used. This implies that the evaluation of fit guides

the test constructors to identify the items that do not fit the whole set of data. Misfitting

items may indicate poor item construction. In this sense, they are similar to the poorly

discriminating items in CTT. They are considered to be ‘problematic’ items and need

revision. A research by Perkins and Miller (1984) found that the Rasch model detected

more misfitting items than CTT and this allowed them to plot reading items at their

calibrated positions along the continuum of item-person map.

A general guideline of fit statistics suggests that an item is relatively fit based on these

two statistics (Wu & Adams, 2008):

a) MNSQ statistics should be close to 1 (depend on the range of chi-square)

b) t statistics should within the range -2 and +2

It is worth to point out here that the MNSQ statistic is sample dependent (Wu & Adams,

2008). That is, if the sample is large, then the MNSQ statistic tends to be close to 1.

Nonetheless, fit t statistic takes sample size into account (Keeves & Alagumalai, 1999).

This signals that if the items do not fit the model, a large sample size will lead to a very

significant t statistic. In a study conducted by Athanasou and Lamprianou (2004) with a

sample size of 270 students, they found no items have fit statistics outside the rule-of-

thumb range. In another study, the Rasch fit analysis revealed many misfitting items in

a data sampled from responses of 2485 students (Zubairi & Abu Kassim, 2006).

Because of the above situations, it is difficult to set an absolute range of value for an

acceptable item fit. The two statistics could vary considerably from dataset to dataset.

Wu and Adams (2008) emphasize that the fit index indicates relative difference, not an

absolute measure of fit because there are many factors that can affect the assumptions of

the Rasch model. Therefore, the procedure to apply fixed limits for fit indices should be

treated with caution.

A review of literature indicates that IRT models can supplement classical methods as

tools for measurement and assessment. Earlier research by Woods and Baker (1985),

26

Zubairi and Abu Kassim (2006) and Henning (1984) provided support for the

complementary nature of CTT and the Rasch model. Other studies demonstrate a

practical application of the Rasch model as a measurement strategy to assess and

evaluate language testing and assessments (see Henning, Hudson & Turner, 1985;

Perkins & Miller, 1984; Athanasou & Lamprianou, 2004).

2.4 THE EFFECTS OF TEST FEATURES AND EXAMINEE’S

CHARACTERISTICS ON ITEM DIFFICULTY OF READING TEST

In the field of language testing, it has been identified that the performance of students

on a comprehension test is a result of interaction between reader, text and test. Bachman

and Palmer (1996) highlighted that the characteristics of the tasks and the characteristics

of individuals that affect both language use and language test performance should be of

central interest in designing any language test. Alderson (2000) also agreed that

characteristics of both reader and test will affect the reading assessment.

Based on the correspondence between reader, text and test, one of the main concerns in

reading comprehension research has been the estimation of the contribution of the

characteristics of test features that are related or unrelated to the construct being

measured and/or test takers to the performance of reading ability. Moreover, the

identification of those factors is a prime concern in language testing to achieve a

construct validation result (Bachman, 1990; McNamara, 1996). This is to ensure that the

test is not affected by threats which may influence the performance of the students.

Acknowledging the importance of the dynamic interaction between reader, text and test,

the present study aims to identify the test features and individual characteristics that

contribute to the item difficulty of the MUET reading comprehension. For the first part

of the study, several test features have been identified to investigate the effect of these

characteristics on item difficulty. The second part of the study comprises a statistical

analysis of test item: differential item functioning (DIF) in order to understand the

relationship between students’ characteristics and test items.

27

2.4.1 Item Test Characteristics

The investigation of the effect of item test characteristics on the difficulty of reading

comprehension item often involves an analysis of various variables including passage

features, question type features and question format variables.

As previously mentioned, construct validity is a fundamental issue that should be taken

into serious consideration in any educational assessment (Messick, 1995; Osterlind &

Everson, 2009; McNamara & Roever, 2006; Camilli & Shepard, 1994; de Ayala, 2009).

Therefore, it is important here to differentiate the variables under consideration into two

subgroups: construct-relevant and construct-irrelevant variance.

Alderson (2000) has asserted that constructs of reading are those “variables that have an

impact on either the reading process or its products” (p. 120). He further exemplifies

those factors such as text variables, linguistic features, reader’s background,

subject/topic knowledge and a range of relevant reading skills and strategies as

important components of reading constructs. It is obvious that the first two variables –

passage features and question type features – are included as the constructs of reading.

On the other hand, construct-irrelevant variance refers to the situation where a test

gauges proficiencies irrelevant to the intended construct. Alderson (2000) identified that

overemphasis on metacognition and metalinguistic knowledge, as well as readers’

motivation and emotional state, are likely to underrepresent the construct of reading. He

also has explicitly highlighted that the test method is one of the sources that can

contaminate the constructs of reading assessment. Evidently, the last variable used in

this study – question format – is a source of construct-irrelevant variance.

Passage Feature Variables

Several features of text have been investigated to predict the difficulty of the reading

test. The most typical variable for passage feature is the type of the text. In the

Progressive Achievement Test in Reading (known as PAT-R), a reading comprehension

test used in Australian Schools, five text types are used; narrative, factual, expository,

tabular or graphical and procedural (Stephanou, Anderson & Urbach, 2008). In other

research regarding the type of text, Carr (2006) classified passage variables into arts,

28

humanities, social science, life science and earth/physical science. He also divided

passage variables into rhetorical features, propositional content, cohesion and focus

constructions.

Another commonly used text feature is the length of the passage. This feature is coded

according to the number of paragraphs and lines each passage contains (Scheuneman &

Gerritz, 1990). It is assumed that passages with more sentences or words are potentially

more difficult to comprehend.

Other studies have indicated a growing interest in examining other new features of

passage variable known to affect reading comprehension, including propositional

density of the text (Ozuru, Rowe, O’Reilly & McNamara, 2008) word frequency (Ozuru

et al., 2008; Davey, 1988; Scheuneman & Gerritz, 1990, Drum, Calfee & Cook, 1981),

concreteness (Davey, 1988) and proportion of clauses (Davey, 1988).

Question Type Variables

Researchers have used a variety of categories for the classification of question type.

Generally, the assumed level of cognitive operation provided by Bloom taxonomies has

dominated most of the question type in reading test and other language tests as well

(Davey, 1988; Davey & Macready, 1985, McKenna & Stahl, 2009). These questions

range from items requiring simple retrieval of information in the passage to those

involving higher level of inference and reasoning.

Pearson and Johnson (1978), classified question type into three simpler taxonomies:

textually explicit, textually implicit and script-based (Alderson, 2000; Davey, 1988).

Textually explicit questions are those items having both the question information and

the correct answer in a single sentence. Textually implicit questions, on the other hand,

require the examinee to locate the information across sentences. The third category,

script-based, involves the integration of text information and reader’s background

knowledge and the correct answer cannot be found in the text itself.

29

The coding scheme above is somewhat similar to McKenna and Stahl’s (2009) three

levels of question types: 1) Literal questions involve retrieval of information that has

been explicitly mentioned in the text, 2) Inferential questions require readers to make

logical connections among the facts in the text in order to arrive at an answer which

cannot be located in the passage, and 3) Critical questions call for value judgement

about the reading material and definitely the answers are not in the text. Using the same

framework of cognitive operation in locating the answers and making inference, PAT-R

has grouped the items of reading comprehension into: retrieving directly stated

information (RI), reflecting on texts (RF), interpreting explicit information (IF) and

interpreting by making inferences (II) (Stephanou et al., 2008). In addition to the

previously mentioned categories, McKay (2006) added another type known as text-

based questions that focus on the grammatical and vocabulary knowledge of the

readers.

Another way of categorizing question type is to distinguish between abstract and

concrete information requested by a question. Mosenthal (1996) addressed the level of

abstractness or concreteness of the information in a question and coded the items into

five levels: 1) the most concrete - calls for identification of persons, animals, or things,

2) highly concrete – asks for information about time, attributes or amounts, 3)

intermediate – requires the identification of manner, goal, purpose, alternative, attempt

or condition, 4) highly abstract – involves the identification of cause, effect, reason or

result, and 5) the most abstract – calls for information on equivalence, difference or

theme.

Other classifications of question type have been suggested by other researchers to

predict the difficulty of reading comprehension items. Researchers have begun to use

the question-classification framework of prior studies and suggested their own version

of coding schemes (e.g. Ozuru et al., 2007, 2008; Scheuneman & Gerritz, 1990; Davey,

1988).

30

Question Format Variables

Perhaps the most popular classification of question format in reading tests is the

multiple-choice item6 vs. free-response item7. Several works have demonstrated that

question formats can serve as a source of difficulty of reading comprehension items

(e.g. Davey, LaSasso & Macready, 1983; Ebel, 1982).

Due to the fact that standardized reading tests often utilize the multiple-choice format,

classifications of question format focus on features of multiple-choice items which

consist of stem and alternatives (made up of several wrong answers, known as

distractors, and at least one correct answer). For example, stem, the stimulus segment or

statement of a multiple-choice item, is frequently grouped into wh- direct question8 and

incomplete statement9 format (Popham, 2000).

Other question format variables that have become of interest for exploration include

stem length, stem content words, structure of alternatives/options, length of correct

answer and distractors, etc. For example, Scheuneman and Gerritz (1990) recommended

three categories of option structures based on the previous work of Carlton and Harris

(1989). The categories were: a) complete sentence or complex phrases containing

clauses that could stand alone as sentence, b) simple phrases, and c) short lists of 1-4

words. In another study, question format is addressed in terms of the falsity of the

distractors. A falsifiable distractor means that the information which establishes that the

option is incorrect is explicit in the text, whereas a distractor is not falsifiable if the

passage does not provide explicit textual evidence (Ozuru et al., 2008).

6 The format which requires students to respond to a question by selecting the correct answer

from three, four or five options

7 Also known as constructed-response item. This question requires students to write or construct their answer, rather than simply selecting it

8 Complete statement of question which normally begins with wh-question (i.e. what, who, when, where, which, why, whose and how) and ends with question mark

9 The question is formatted as incomplete statement where an omission occurs at the end of the stem/question

31

2.4.2 Individual Characteristics and Differential Item Functioning (DIF)

According to Bachman’s and Palmer’s (1996) philosophy of language testing, fairness

is one of the central considerations in test design. Fairness stipulates equal educational

opportunities for all students regardless of their ethnic background, economic and social

status and gender. The Code of Fair Testing in Education developed by the Joint

Committee on Testing Practices (2004) urges test developers to design tests that are as

fair as possible without demeaning the examinees of different races, ethnic background,

gender or demographical location (rural and urban).

Fairness is a complex and broad area, involving test design, development, test

administration and scoring procedures (Kunnan, 2000; Popham, 2000; McNamara &

Roever, 2006). In the layperson’s view, bias is typically associated with unfairness and

favouritism. Psychometrically, Angoff (1993) defined an item is biased if test takers of

equal ability from different groups respond differently to the item. Shepard et al. (1981)

defined bias as “a kind of invalidity that harms one group than another” (p. 318).

In the language testing context, examination items are considered biased if they contain

sources of difficulty that are not relevant to the construct being measured (Zumbo,

1999). This suggests that bias is present when construct-irrelevant characteristics of the

test takers influence the score of a test. An item might also be considered biased if it

contains language or content that is differentially difficult for different subgroups of

test-takers. In addition, an item might demonstrate item structure and format bias if

there are ambiguities or inadequacies in the item stem, test instructions, or distractors

(Hambleton & Rogers, 1995).

There are two methods to investigate potential bias in measurement/assessment

(Zumbo, 1999); (a) judgmental and (b) statistical. Zumbo recommended that in a high-

stake test, statistical techniques seem feasible and defensible to flag potentially biased

items and this leads us to differential item functioning (DIF).

The problem of inconsistent behaviour of common items across administrations can be

viewed as an instance of (DIF), where two groups taking two different forms with some

items in common are the focal and reference groups. Supposedly, two groups of student

32

with the same level of English language proficiency should have equal probabilities of

responding to a reading test item correctly. If their probabilities are different, the item is

said to exhibit DIF.

Dorans and Holland (1993) defined DIF as a psychometric difference between groups

that are matched on the ability or the achievement measured by an item. That is, an item

exhibits DIF if it provides a consistent advantage or disadvantage to members of a

group, not because of differences in the trait of interest, but because of differences in

other traits or because different versions (e.g., translations) of an item measure different

traits. More simply, when examinees in different groups have different probabilities of

answering an item correctly after controlling for overall ability, the item is said to

exhibit DIF (Shepard et al., 1981).

de Ayala (2009) further defined DIF based on its graphic representation; the differences

between two item response functions (IRF), commonly known as ICC slopes/curves.

The IRFs represent the item parameter estimate of the focal group and the reference

group. An item is flagged to have DIF when the two IRFs are not superimposed on one

another. Figure 2.4 and Figure 2.5 show an illustration of DIF.

Figure 2.4: ICC Plot of Item with DIF

33

Figure 2.5: ICC Plot of Item without DIF

Figure 2.4 shows that the observed curves of the two groups (for example, bumiputera

and non-bumiputera) are far apart from each other. On the other hand, an item with no

significant DIF index in Figure 2.5 indicates that the curves representing boys (L) and

girls (P) are very close. de Ayala’s elaboration of DIF is equally suitable with the

definition provided by Hambleton (1989): “a test is unbiased if the item characteristic

curves across different groups are identical” (p. 189).

The use of DIF analysis, however, should be dealt with caution. Some researchers have

at times used the terms ‘item bias’ and DIF interchangeably. Camilli and Shepard

(1994) and Angoff (1993) advocated that the two terms must be treated as two different

entities to avoid the perception that DIF is a source of bias. The term DIF, though, is

very much preferred by the researchers due to its concern on what is actually being

observed rather than making inference of the nature of the effect of variance

(Scheuneman & Gerritz, 1990, Osterlind & Everson, 2009).

Therefore, it is important to emphasize that DIF is not a prima facie evidence for a test

bias (Angoff, 1993; Camilli & Shepard, 1994; Zumbo, 1999; McNamara & Roever,

2006). This implies that the existence of group differences in test scores does not

34

necessarily mean the test scores are biased. Thus, if items do not exhibit DIF, then it is

likely that no test bias is present. However, the presence of DIF in an item is not

necessarily an indicator of test bias, rather, it should be investigated further to unlock

the possible causes for its differential functioning. A review analysis known as “logical

evidence of bias” should be implemented to determine the relevance or irrelevance of

the source of DIF to the construct being measured (de Ayala, 2009). The earlier DIF

studies (see Elder, 1996; Chen & Henning, 1985; Kunnan, 1990) have found no pattern

of bias. For example, Chen and Henning (1985) pointed out that the differences in

performance between native speakers of Chinese and native speakers of Spanish in

UCLA’s ESL placement test were actually real and simply a reflection of English use in

real-world interaction. The items flagged to advantage the Spanish test-takers were not

bias as English consists of some Spanish cognates.

Another major caveat that should be taken into account in DIF is that the explanation of

DIF is highly speculative and it is a matter of guesswork (McNamara & Roever, 2006).

The crux of the matter lies in the question whether the DIF occurs due to the real

difference in the attribute being measured or as a result of extraneous factors. Therefore,

Camilli and Shepard (1994) strongly recommended that a statistical finding of DIF

should be reviewed by a panel of experts as to explain why the items seem to be

relatively more difficult for different groups. Regardless of DIF analyses, the removal of

items which seem to disadvantage one particular group is subjected to the decision of

the panel of experts.

2.5 REVIEW OF PREVIOUS STUDIES

Numerous literature reviews have indicated that the two variables; item test features and

individual characteristics of examinees, are relevant predictors of performance on

reading comprehension test.

2.5.1 Related Studies on Item Test Features

A review of studies examining the effect of test features on a test performance suggests

that variation of specific characteristics influence difficulty of comprehension items. In

35

reading comprehension, several empirical studies identified a number of features that

may affect reading task difficulty. Davey and Lasasso (1984) noted that textually

explicit items were significantly easier than the textually implicit ones.

Using other classifications of test variables, Freedle and Kostin (1993) analyzed TOEFL

reading items and found that seven categories of item characteristics affected the item

difficulty. The seven variables noted as the sources of item difficulty were; lexical

overlap between text and the key, sentence length, passage length, paragraph length,

rhetorical organization, the use of negation, the use of referential and passage length.

More recently, Ozuru et al. (2008) compared the extent to which item and text

characteristics predict the difficulty of comprehension item for the 7th – 9th and 10th –

11th grade levels. The result highlighted that young readers were primarily influenced

by text features – in particular vocabulary difficulty. This finding was consistent with

those of Just and Carpenter (1992) and Perfetti (1985) that implies that passage contains

unfamiliar, infrequent words tend to contribute to the low performance on the test.

Exploration to assess the effect of test feature on reading tests continues. Athanasou and

Lamprianou’s (2004) investigation on a diagnostic reading test of Greek-Australian

high school students showed that longer key words had a large effect on the difficulty of

the questions. The findings of several studies (e.g. Ozuru et al., 2008; Just & Carpenter,

1992; Perfetti, 1985, Athanasou & Lamprianou, 2004, Rupp, Garcia & Jamieson, 2001)

summarized that the length of test features had a significant influence on the

performance of reading test: the longer the words, the more difficult the question was.

In a different study, Davey (1988) assessed 20 predictor test features on reading

performance of successful and unsuccessful readers. The finding revealed that two

features - location of response information and stem length - showed a significant effect

for both groups. A somewhat similar result was coined by Sheehan and Ginther (2001)

who identified that the item difficulty of reading test from TOEFL was largely

influenced by the location of the target information.

36

Another interesting issue related to the test features variable is concerned with the

relative contribution of passage and no-passage factors on performance of reading test.

Researchers embarked on investigating the effect of text availability when students

answered the question. In many of these studies, participants were divided into two

groups; the first group was allowed to look back at the test when answering the

question, while in the other group, the text was removed before the questions were

answered.

The results of text availability studies have been mixed. Freedle and Kostin (1994)

concluded that two thirds of the variance in item difficulty was associated with variables

that were solely related to the passage. However, the findings of other researches (Katz

& Lautenschlager, 1994, 2001; Embretson & Wetzel, 1987; Ozuru et al., 2007)

contradicted the former study. The latter studies pointed out that reading comprehension

task may measure factors that have nothing to do with the passage. This revealing

summary raised the question about the validity of multiple choice reading tests.

2.5.2 Related Studies on DIF

In the relevant literature on the effect of individual characteristics on the test

performance, it is interesting to note that gender difference has predominantly been

investigated as a source of variance in academic achievement. Reports of many

assessments have described that there is a gender gap in attainment in language and

other literacy-based subjects throughout the education system. For example, the

Programme for International student Assessment (PISA) in 2000 when reading literacy

was the focus of study in 32 countries, females outperformed males in all participating

countries (Kirsch et al., 2002). Perhaps the outcomes of such reports have prompted the

researchers to utilize DIF analysis in order to understand gender differences.

With regard to the assessment of reading comprehension, evidence about students’

reading preference suggests that girls are more accustomed to reading narratives, and

this may be one of the factors that contribute to better performance on related items

(Twist & Sainsbury, 2009; Scheuneman & Gerritz, 1990).

37

DIF analysis is not limited to gender. Why students with similar ability level do not

obtain similar test scores in standardized examination has also been attributed to other

factors such as difference in racial and ethnic groups and academic background. In their

case study, Alderson and Urquhart (1983) pointed out that student academic discipline

affected the comprehension of reading. The participants who consisted of four different

groups of university students from different faculties showed that reading texts which

contained familiar content related to their area of study, were likely to help them to

perform better. The research conducted by Pae (2004) replicated the same observation.

The result of DIF analysis reflected that items dealing with science-related topics were

differentially easier for the Science students, whereas items about human relationship

were easier for the Humanities.

Further explorations of DIF in reading assessment have been attempted to link it with

the features of the test itself. Scheuneman and Gerritz’s (1990) study tried to examine

the extent to which item features associated with examinees’ characteristics are likely to

reflect the item difficulty. The finding showed a number of significant relationships

between item features and indicators of DIF. As an example, Science content in reading

passage was associated to be relatively difficult for both females and Black examinees.

This information indicates that the difficulty of text type is presumably linked to DIF

indicators, in this case, gender and ethnicity.

2.6 SUMMARY

This chapter has concentrated on the literature review relevant to the current study. It

has looked at the complexity of the reading process and how this complexity poses a

challenging task for the test constructors and item writers.

The chapter also has introduced some psychometric properties of CTT and the Rasch

model, the two widely used tools for test development and improvement. This is to

provide insights for the item-analytical procedure to make judgement on the quality of

the items of the MUET reading paper.

38

As emphasized by experts in language testing, the identification of factors affecting the

performance of students in language tests is of central interest in test construction. This

study, therefore, is an attempt to investigate the effect of two important components –

the item test features and examinee’s characteristics – on the reading test items. An

overview of several types of test features is presented. It describes three major variables

– passage, question type and question format – which are often used to predict the

difficulty of reading test items. The chapter also outlines the discussion on DIF, a

statistical procedure to explore the influence of examinee’s characteristics such as

gender, ethnicity, socio-economic status and demographic location, on individual test

items.

This chapter also has reviewed some previous studies related to the impact of test

features and individual background on the performance of reading test. The review has

pointed out that the findings of past studies report a significant relationship between

item difficulty of reading test and test features and indicators of DIF. Previous research

indicates that the effect of item features and individual characteristics on the assessment

of comprehension is complex and depends on the nature of the source of the text and

individual item of the test in question.

39

CHAPTER 3: METHODOLOGY

3.1 INTRODUCTION

This study is built on the inquiry about several important questions related to the

performance of students on the reading test of MUET, the high-stake English test used

for admission for Malaysian public universities. It is an item-level based analysis which

seeks to explore the following questions:

1. How do the items spread in terms of their difficulty value and ability of the

students?

2. How good are the items of the MUET reading test?

3. To what extent do the features of the test contribute to the difficulty of item in

the MUET reading test?

4. Is there any differential item functioning in the MUET reading test in term of


In order to address the above questions, the study has undertaken three phases of

empirical exploration. Phase 1 involves item analysis of the MUET Reading

Comprehension paper. Item characteristics such as difficulty and discrimination indices

were examined using the CTT analyses and one-parameter IRT model, the Rasch

model. In Phase 2, the test items are analyzed using regression to assess the effect of

various features of passages, question types and question formats on the difficulty of the

reading items. Prior to this, all the 45 individual items were coded in terms of their test

task characteristics. There are six features chosen for this study which are discussed in

detail in the subsequent section. The last phase addresses the potential issue of

invariance of item difficulties for the subgroups of examinees who sat for the test. DIF

analysis is utilized here to examine the extent to which test-takers’ characteristics (i.e.

gender, ethnicity and demographic location) may influence their success to respond

correctly to the test items.

40

For the statistical procedures mentioned above, this study has utilised ConQuest, a

computer program for fitting item responses and latent regression (Wu, Adams, Wilson

& Haldane, 2007), in performing CTT and the Rasch item analyses and examining DIF.

Regression analysis, in addition, was used to investigate the relationship between the

chosen item test characteristics and item difficulty of reading test.

3.2 DESCRIPTION OF THE DATA

The data used for this study was a secondary data of End-MUET Reading Component

which was administered in October 2009. The data was a courtesy from the Malaysian

Examinations Council, the body authorized for the administration and design of the test.

The data was a sample of 8472 students, consisting of 3389 males and 5089 females

from two states; Sabah and the capitol territory, Kuala Lumpur. Majority of these

students were the Sixth form students from public schools, private schools and state-

funded schools. Nonetheless, as shown in Table 1.1 (see Chapter 1), there were other

candidates – matriculation centre students, diploma holders, bachelor holders – who sat

for this standardized university entrance English test.

3.3 DESCRIPTION OF THE MATERIALS

The reading component in the test battery is a 90-minute paper comprising 45-

dichotomously scored items. This test primarily focuses on expository tests which are

taken from newspapers, journals, magazines and academic books/materials. The

questions are based on six passages of varying length and difficulty level. There were

six different topics for this particular MUET’s end-year paper; Legal Issue/Law,

Advertisement/Communication, Business, IQ Enhancement, Travel and Youth

Leadership.

The MUET booklet on Regulations, Test Specifications, Test Formats and Sample

Questions published in 2006 by the Malaysian Examinations Council describes the test

as a competency test to assess a candidate’s ability to comprehend various types of text

of different length and level of complexity (content and language). Generally, the

41

assessment of this reading test is primarily focused on the cognitive operation under the

framework of Bloom taxonomies. The skills assessed by the test items include:

(i) Comprehension

Skimming and scanning

Extracting specific information

Identifying main ideas

Identifying supporting details

Deriving the meaning of words, phrases, sentences from the context

Understanding linear and non-linear texts

Understanding relationships within a sentence and between sentences

Recognizing a paraphrase

(ii) Application

Predicting outcomes

Applying a concept to a new situation

(iii) Analysis

Understanding language functions

Interpreting linear and non-linear texts

Distinguishing the relevant from the irrelevant

Distinguishing fact from opinion

Making inferences

(iv) Synthesis

Relating ideas and concepts within a paragraph and between

paragraphs

Following the development of a point or an argument

Summarizing information

(v) Evaluation

Appraising information

Making judgements

Drawing conclusions

Recognizing and interpreting writer’s views, attitudes or intentions

42

3.4 DESCRIPTION OF THE PROCEDURES

3.4.1 Coding of Individual Items

The 45 questions of the MUET reading comprehension and their answers options were

coded using various classification schemes. Generally, three major facets – passage

variable, question type variable and question format variable – were identified from the

literature to be examined as a potential source of item difficulty of the reading test.

These variables were then subcategorized into various types.

The first coding involved the passage variable which focused only on one variable; the

length of the passage. This variable was coded based on the number of the lines in the

text. The scheme was taken from the study conducted by Scheuneman and Gerritz

(1990). The texts were coded into three types according to the number of lines each

passage contained: (i) text with more than 25 lines, (ii) text with more than 35 lines and

(iii) text with more than 45 lines.

The second major variable dealt with the type of questions. It was then subdivided into

two sub-variables; (a) type of questions (Stephanou et al., 2008) and (b) inference type

(Davey, 1988). The type of question addressed the type of passage comprehension

processes that the students needed to engage, in order to reach for an answer. There

were four levels in this scheme, which was developed for PAT-R 2008. However, for

the purpose of this study, the original order has been rearranged to fit accordingly to the

difficulty level of the items in this test. The first level comprises retrieving directly

stated information (RI) for which the answer of the question was explicitly stated in the

passage. Minimal text processing is required for this question as students need to

identify a close match or synonym between wording of the items and the relevant

information in the text. The second level involves interpreting explicit information (IE)

across sentences or paragraphs. Answering this question involves identifying a

paraphrase or rewording of the information in the passage which matches to the

wording in the item. As contrast to the first level, students need to link and combine

explicitly stated information across several sentences or sections of a passage. The third

level consists of interpreting by making inference (II) question. In order to reach for a

43

correct answer, this question requires students to recognize the information implied or

suggested by the author using the clues in the passages, but not directly stated in the

text. Finally, the fourth level is reflecting on texts (RF). The information needed to

answer the item is not explicitly stated in the passage. The test takers must bring their

prior knowledge to make inference about situation in the text.

The above coding system was somewhat similar to Ozuru’s et al. (2008) classification

on the reading comprehension items in terms of relation between passage and question.

PAT-R’s coding was preferred in this study as opposed to Ozuru’s due to two reasons.

First, it appears that PAT-R’s category clearly divides the items into explicitly and

implicitly stated information. Specifically, the first two categories – RI and IE – are

concerned with the precisely-mentioned relevant information in the passage. On the

other hand, the last two; II and RF, focus on the information which is not directly stated

in the text. Second, it differentiates the level of comprehension in details from sentence

level (RI), across-sentence/paragraph level (IE), integration of textual information with

inference skill (II) and integration of textual information with prior knowledge (RF).

The other category; inference type, was coined from Davey’s research in 1988. This

variable classifies items into five categories representing their inference level required

to reach for the key answer; (i) no inference, (ii) paraphrase inference, (iii) across-

sentence (bridging) inference, (iv) macrostructure (gist) inference and (v) reader-based

(prior knowledge) inference.

In addition to the previously mentioned variables, items were also coded in relation to

the question format which concerned with features of the options and distractors of the

questions. Firstly, options were analyzed according to their structure (Scheuneman &

Gerritz, 1990; Carlton & Harris, 1989). Three categories are used to appropriately fit the

majority of items; (i) short lists of 1-4 words, (ii) simple phrases and (iii) complete

sentences or complex phrases containing clauses that could stand alone as sentence.

Options are also grouped according to the number of alternatives given. In this test,

there were only two types used; (i) 3-option and (ii) 4-option. In order to investigate the

effect of the distractor to the item difficulty, distractors were also coded following the

44

coding method of Davey (1988). This analysis addressed the degree to which the

incorrect response options (distractors) were plausible. The coding schemes for this are;

(i) no distractors are plausible and (ii) one or more distractors were plausible.

To summarize, there were six predictors chosen for this study to assess their relative

significance on the difficulty of the 45 MUET reading comprehension items (see details

of the coding scheme in Appendix B). Those test features that had come under scrutiny

were:

1) Passage variable

a) The length of the passage

2) Question type variable

a) Type of question

b) Inference type/level

3) Question format variable

a) Structures of the responses/alternatives

b) Number of options/alternatives

c) Plausibility of the distractors

3.4.2 Statistical Analyses

3.4.2.1) CTT and Rasch Item Analyses

The first statistical analysis performed for the data was item analysis under the CTT and

one parameter model of IRT framework using the statistical software, ConQuest which

can produce analyses for both approaches.

In the traditional item analysis which focuses on the total score of a test, the items were

examined based on the three major psychometric characteristics of CTT; item facility

indices, discrimination indices and reliability coefficients. The command file for item

analysis of this dichotomous data is found in Appendix C.

45

In CTT, item facility index is calculated simply by dividing the number of examinees

who obtain the correct answer by the total number of examinees who answer it. .

Equation (3.1) depicts the formula for the item difficulty index:

In this study, the facility indices of all 45 items are assessed using the criteria proposed

by Henning (1987) as shown in Table 3.1 below.

Table 3.1: Classification of Item Facility Index (Henning, 1987)

Item Facility Index Item Description 0.67 and above Too easy

Below 0.33 Too difficult

Psychometricians have developed many indices to assess the discriminating ability of

test items. In CTT, probably the simplest procedure for calculating an index of

discrimination is based on the correlation between each item and the total score

(Nunnally & Bernstein, 1994). This correlation between a person’s score on an item and

his or her total score is also called ‘point biserial correlation’. For the discrimination

index, the items were classified according to the criteria of Ebel and Frisbie’s work

(1991) as previously mentioned in Chapter 2 (Section 2.3.1).

In addition to the above properties of CTT, the data set was also scrutinized for a

distractor analysis. This analysis aims to provide insights for the test developers to

determine whether the distractors (the incorrect answers) are functioning as intended.

Another way to check the quality of the item under CTT is through the reliability

coefficients which indicate the consistency of the test score. The reliability of a test is

often evaluated based on the following general guidelines:

p = Number of examinees with correct response

Number of examinees

46

A reliability estimates of 0.90 or even 0.95 are highly expected for high stake

tests which involve making important decision about individuals (Nunnally &

Bernstein, 1994; Reynolds et al., 2009)

In many testing situations like personality test and group and individually

administered achievement tests, reliability index should yield at the minimum of

0.80 (Reynolds et al., 2009)

Modest reliability indices of at least 0.70 are acceptable for teacher-made

classroom tests (Reynolds et al., 2009)

Reliability indices as low as 0.50 can be tolerated for teacher-made test,

provided that the test score will be combined with the scores from other

assessments (Ebel & Frisbie, 1991)

It can be seen from the preceding guidelines that a satisfactory level of reliability is

often associated with the intended purpose of the test. That is, the procedure to set

minimum values of reliability coefficients need to be established according the context

of score use. Therefore, there are no absolute standards to serve as fixed criteria to make

judgement about the reliability of a test.

In this study, several analyses were performed to see the increase of reliability statistic

by removing those items with low discriminating indices. It is recommended by the

psychometricians that deletion of non-discriminating items is one of the measures taken

to improve the reliability of a test (Alugamalai & Curtis, 2005; Reynolds et al., 2009).

For the Rasch analyses, test items were also evaluated using the item difficulty and item

discrimination index. Unlike CTT, difficulty of item is estimated from the probability of

a person to have 50 percent chance of getting the item correct. The probability of a

correct response to an item is a function of ability as expressed in the item characteristic

curve (ICC). The higher a person's relative ability to the difficulty of an item, the higher

the probability of a correct response on that item. When a person's location on the latent

trait is equal to the difficulty of the item, there is by definition a 0.5 probability of a

correct response in the Rasch model (Hambleton et al., 1991; Bond & Fox, 2007). In

47

theory, both item difficulty and ability can range from -∞ to +∞ though typically

in practice they fall within -3 to 3 (de Ayala, 2009 ).

Item Characteristic Curve

14

Probability of Success

Very low achievement Very high achievement

1.0

0.0

0.5

Figure 3.1: Probability of Success on an Item (Wu & Adams, 2008)

Figure 3.1 is a graphical display of the ICC which depicts the notion of probability of

success on an item. The horizontal axis is the ability scale, called the logit. The right

end characterizes high ability group, whereas the left end represents low ability group.

The vertical axis denotes the probability of correct answer. As noted in the above ICC,

the probability of a correct response on an item increases steadily as ability increases.

The distribution of items can be traced from the item-person map which displays the

item parameters and person ability parameters on the same scale.

According to the Rasch model, all items are expected to be equally discriminating. Wu

(2010) recommended two ways to trace the discrimination power of an item. First, it

can be observed from the steepness of the empirical ICC. A steeper slope suggests that

the probability of lower achiever students to choose an incorrect answer is higher than

the model predicts. Conversely, the probability of getting the correct answer is higher

for upper group students than the model predicts. This characteristic indicates a high

discrimination index of the observed item.

48

Another method to examine the discrimination index is through a fit statistic which is

represented by fit means square – MNSQ - in ConQuest (Wu & Adams, 2008). This

approach suggests that a low discriminating item has MNSQ index of greater than 1,

whereas a high discriminating item is reflected by MNSQ index of less than 1.

Another key aspect of Rasch analysis is the evaluation of fit which offers information

about items that do not contribute to the whole set of data. Several methods have been

employed to assess the goodness of fit of a dataset to the Rasch model. Perhaps one of

the most promising goodness-of fit analyses is the residual based fit statistics

(Hambleton et al., 1991). This type of statistic concerns the degree of correspondence

between what is expected and what is observed (Bond & Fox, 2007; McNamara, 1996;

Hambleton et al., 1991). This method suggests that the evaluation of model-fit data can

be tested by comparing the predicted and the actual responses of examinees to particular

items.

In ConQuest, residual fit statistics are reported in the form of two sets of statistics; mean

square statistics and t statistics. Theoretically, the mean square statistics (MNSQ)

denotes that the slope of the observed ICC should be similar to the slope of the

predicted/modeled ICC (Wu & Adams, 2008). It is assumed that this statistic has an

expectation of one. Hence, in order for items to be considered to fit the Rasch model,

they should have MNSQ statistic with the range 0.77 to 1.30 (Adams & Khoo, 1993) or

a narrower range from 0.83 to 1.20 (Keeves & Alagumalai, 1999).

The second statistic used as an indication of misfit is the t statistic which is defined as

“a normal deviate with a mean of zero and a standard deviation of one” (Wu & Adams,

2008). Many psychometricians conventionally accept that the value of t outside the

range of -2 to +2 shows the violation of the expectation of the Rasch model (Bond &

Fox, 2007; McNamara, 1996; Wu & Adams, 2008).

3.4.2.2) Regression Analyses

In response to the third research question “to what extent do the features of the test

contribute to the difficulty of an item in the MUET reading test?”, regression analyses

49

were performed. The analyses were conducted to assess the simultaneous effects of six

different test features on the item difficulty of the MUET reading test.

Regression analysis is a statistical tool that is designed to model the relationships

between variables; dependent variable10 and independent variable11. It is often used to

explore the causal effect of one variable upon another – the effect of price increase upon

demand, for example. To address the issues in question, one needs to gather data on the

variable of interest and employs regression to estimate the quantitative effect of the

independent variable(s) upon dependent variable.

In this study, linear regression was employed to test for correlation between the

dependent variable (item difficulty) and six independent variables (various test features

based upon prior research). The variables identified to predict the source of item

difficulty were:

1) The length of the text

2) Type of question

3) Inference type/level

4) Structure of the response/alternative

5) Number of options/alternatives

6) Plausibility of the distractors

A keen interest of this study is to examine the extent to which the above variables yield

similar results as those in the earlier research on the factors affecting the difficulty of

reading test. The findings of the current study are expected to confirm the following

hypotheses:

10 Also referred as criterion variable. It is a variable that is affected or changed by the

independent variable.

11 A variable which is presumed to affect or determine a dependent variable. Also known as predictor variable.

50

1) Lengthy passage is likely to be more difficult to comprehend compared to

shorter ones because the density of words or sentences may affect the

information-processing demand of the text

2) Items with implicitly-textual evidence are significantly difficult as they require

the examinees to infer the correct answer throughout the passage

3) Items requiring higher level of cognitive skills turn out to be more difficult to

answer

4) Responses/alternatives that contain fewer words tend to be easier because longer

sentences/phrases place greater demand in understanding

5) Questions with more options to choose from tend to be difficult because students

need to assess all the information given in order to reach for the best answer

6) Items with more than one plausible distractor are potentially difficult as they

share more information with the correct answer and the information in the

passage

3.4.2.3) DIF Analysis

Analysis of DIF is crucial in reading assessment due to the likelihood that readers have

different ways of reaching an answer. This analysis is useful to reveal the possible

source of item difficulty of reading test between different groups of examinees.

DIF is a statistical procedure employed to investigate individual items in a test. It

signals an indication of departure from the IRT model if examinees from different

groups (i.e. gender, ethnicity or socio-economic status) have different probability to

give a certain response to an item.

In the Rasch model, an item exhibits DIF if there is significant variance between the

subgroups. This statistical difference can be calculated by dividing the estimated DIF

parameter by its standard error. If the index is not in the range between -2 and +2, then,

the item is flagged as exhibiting DIF.

To inspect if there is existence of DIF in this MUET reading paper, the analysis

involved contrasting the test-takers according to their gender (male and female),

51

ethnicity (bumiputera and non-bumiputera12) and demographic location (Kuala Lumpur

and Sabah).

3.5 SUMMARY

This chapter has outlined a range of methods to address the research questions presented

in Chapter 1. The first statistical procedure under the framework of CTT and the Rasch

model has taken into account the major psychometrics properties to make judgement

about the quality of items of this high-stake standardized test.

Regression analysis has been employed in response to the issue of relationships between

item difficulty and the test features. This is to discover which variable has a profound

effect on the performance of test-takers in the reading test. The last empirical

exploration has dealt with DIF analysis which aims to assess the influence of students’

characteristics on their response to individual items in the test.

In summary, it is necessary to scrutinize the test items and examinees’ characteristics to

establish whether there is any evidence that the results are affected by factors other than

reading ability.

12 consists of the Chinese and Indians who originally migrated to Malaya (Malaysia) to work in

gold mines and rubber estates.

52

CHAPTER 4: FINDINGS OF THE STUDY

4.1 INTRODUCTION

The aim of this research is to evaluate the individual items of MUET reading

comprehension test. Therefore, the study has formulated four research questions which

have been addressed in Chapter 1.

In the following sections, the results of the statistical analyses for answering the

research questions are presented. The first section focuses on the outputs of CTT and

the Rasch model in evaluating the quality of the test from an item-level analysis of item

difficulty, discrimination index, fit indices and test reliability. Next, the findings of the

effect of test features, i.e., the six predictors, on the difficulty of the reading

comprehension items are discussed. A description of the results of DIF analyses is then

explained in the last section.

4.2 RESULTS OF ITEM ANALYSIS

In this section, the psychometric analyses of CTT and Rasch model will be discussed.

The findings are evaluated according to the item psychometric properties – item

facility/difficulty, discrimination index, fit statistics as well as overall test reliability.

Item Facility

In CTT, the difficulty of an item is defined as the percentage of the group who answer

the item correctly. The results of Table 4.1 show that the difficulty level of the items

range from 0.03 to 0.81 where 0 is for an item where no student obtain correct answer

and 1 is for an item with correct response from all examinees. All the 45 items can be

categorized according to Henning’s (1987) criteria as follows:

53

Table 4.1: Categories of CTT Item Facility Index

EASY ITEMS (Item Facility ≥ 0.67)

AVERAGE DIFFICULTY ITEMS

DIFFICULT ITEMS (Item Facility ≤ 0.33)

ITEMS

8, 17, 22, 25, 26, 31,

33, 37, 44

1, 2, 3, 4, 5, 6, 7, 11, 12, 13, 14, 15, 16, 18, 19, 21, 23, 27, 29, 30, 32, 34, 35, 39, 40, 41,

43

9, 10, 20, 24, 28, 36,

38, 42, 45

TOTAL 9 27 9

As can be seen, it appears that there is an even distribution of easy and difficult items in

the MUET reading paper. It is also apparent that most of the items are classified within

the average difficulty level.

These results are consistent with results of the Rasch analysis since there is a one-to-one

correspondence between Rasch item difficulty and CTT item facility. The item-person

map offers a clear view of how the items are spread across the continuum of ability

range of the groups of students who sat for this particular test.

Figure 4.1 shows the item-person map. The left panel is a display of the latent ability

distribution on a scale known as logits. For this reading test, the range of the students’

ability is at a between -2 and 2. logits The right panel, on the other hand, shows the item

parameters which are plotted according to their difficulties on the same logit scale. The

easiest item is plotted at the bottom of the map, while, the most difficult item is plotted

at the top of this scale. In Figure 4.1, Item 38 with the greatest estimated difficulty

(delta value of 3.33), appears to be the most difficult item, whereas the easiest item is

Item 26 which has the lowest estimated difficulty of -1.74.

54

dif

Figure 4.1: The Item and Latent Distribution Map

========================================================================= ConQuest: Generalised Item Response Modelling Software MAP OF LATENT DISTRIBUTIONS AND THRESHOLDS ========================================================================= PERSON ITEM PARAMETERS

|38 | | 2 | | | | Difficult items | |24 X| |36 | X|9 42 X|28 XX|20 1 XX|45 XX| XX| XX|10 XXXX| XXXX| XXXX|32 34 XXXXXX|1 23 XXXXX|14 39 43 XXXXXX| Average Difficulty items XXXXXX|11 13 XXXXXXX|7 41 0 XXXXXXXX|3 12 27 XXXXXXXXX| XXXXXXXX|16 XXXXXXXXX|2 4 5 40 XXXXXXXX|30 XXXXXXXXXX|18 XXXXXXXXXX|19 21 XXXXXXXXX|29 35 XXXXXXXX|6 XXXXXXXX| XXXXX|15 XXXXX| -1 XXXXX|33 XXX|44 XX|8 XX|17 X| X|22 X|25 31 Easy items |37 X| |26 | | -2 | | |

Average ability

55

The Rasch model expresses the estimate of item difficulty as the probability that a

person of a given ability will have 50 percent chance getting the item correct

(Hambleton et al, 1991; de Ayala, 2009). This notion of the probability of getting the

item correct can be visualized by the graphical displays of the following ICC plots in

Figure 4.2 (Item 38) and Figure 4.3 (Item 26).

Figure 4.2: ICC Plot of Difficult Item (Item 38)

Figure 4.3: ICC Plot of Easy Item (Item 26)

56

As observed in Figure 4.2, the difficulty of Item 38 is estimated at delta value of 3.33.

This means that if a person who has an ability of 3.33, he or she will have 50 percent

chance of obtaining the correct answer. In contrast, Item 26 seems to be an easy item

because a person with an ability of -1.75 will have 50 percent chance of obtaining the

correct answer to this item. That is, the item difficulty is estimated at -1.75. Thus it can

be concluded that difficult items will be plotted at the positive end of the scale, while

easy items will be at the negative end.

The average ability of the examinees for this particular test was - 0.168 with a small

standard error of 0.008. As seen in the item-person map, the test was reasonably well

targeted to the ability levels of the students. The items are located across both sides of

the central ability of students.

Discrimination Index

Another basic consideration in evaluating the performance of a test item is the degree to

which the item discriminates between high achieving students and low achieving

students. The practical rule is, the higher the discrimination index, the better the items.

For this MUET reading component, the results are shown in Table 4.2:

Table 4.2: Categories of CTT Discrimination Index

POOR ITEMS

(DI ≤ 0.19)

MARGINAL ITEMS

(0.20≤ DI ≥0.29)

REASONABLY GOOD ITEMS

(0.30≤ DI ≥ 0.39)

VERY GOOD ITEMS

(DI ≥ 0.40) ITEM

5,8,9,10,11,15,18,20,23,27,28,34,35,3

8,42

7, 12, 14, 16,19,

21, 24, 26, 29, 31, 44, 45

1, 3, 4, 6, 13, 17, 22, 25, 32, 33, 36, 37, 39, 40, 41, 43

2, 30

TOTAL 15 12 16 2

Following the criteria proposed by Ebel and Frisbie (1991) to present items in four

categories, Table 4.2 shows that about sixty percent of the reading items are weak in

discriminating between high ability students and low ability students. That such a large

57

proportion of weak items – 27 items out of 45 items- may affect the interpretation of

test score as a good test.

A further examination of the discrimination index through the analysis of Rasch model

can be observed from the steepness of empirical/observed ICC. Two items – Item 2

(Figure 4.4) and Item 8 (Figure 4.5) – are used as an example of this approach.

Figure 4.4: ICC Plot of High Discriminating Item (Item 2)

58

Figure 4.5: ICC Plot of Low Discriminating Item (Item 8)

As shown in Figure 4.4, the observed curve13 is steeper than the theoretical curve14. The

steepness of the slope suggests that the probability of lower achiever students to choose

incorrect answer is higher than the model predicts. Conversely, for high achievers, the

probability of getting correct answer is higher than the model expects. The steep slope

of the observed curve indicates that Item 2 has a higher discrimination power than

expected by the model.

On the other hand, for the low discriminating item represented in Figure 4.5, the

empirical curve is slightly flatter than the theoretical one. This flatness of slope

describes that Item 8 is less discriminating than predicted by the model. Tendency of

high ability group to choose the correct answer is lower than expected.

13 Known as empirical curve which denotes the raw data collected for the items (the dotted line)

14 Derived after the computation of item difficulty estimate (the solid line) and also known as the expected ICC

59

Another way to check the discriminating power of an item in Rasch model is to use the

fit statistic – reported as mean square (MNSQ) in ConQuest (Wu & Adams, 2008). It

postulates that discriminating power of an item is reflected by the extent to which the

MNSQ is greater or less than 1. In Figure 4.4, for example, it is observed that when the

empirical ICC is steeper than the theoretical curve, the MNSQ is less than one (0.91).

This indicates that Item 2 has high discriminating power. Note, on the other hand, Item

8 (Figure 4.5) is less discriminating due to its flatter observed curve and MNSQ index

of more than 1 (1.06).

It is important to mention that inspection of item discrimination often leads to another

important analysis, that is distractor analysis. This is because less discriminating items

are frequently scrutinized for distractor analysis to examine the possible source of their

low discrimination indices. Popham (2000) and Reynolds et al. (2009) advocated that

such items need closer examination and possible revision.

In the light of this study, it is apparent that item distractor analysis plays an important

role to explore the causes of poor discrimination power of 15 items (items with close-to-

zero and negative discrimination index) which are classified as poor items (see Table

4.2).

The finding of the analysis finds out that there are two items – Item 34 and Item 38 –

which require further investigation. It is noted that these items have negative value of

discrimination indices; (-0.06) and (-0.11) respectively. CTT analysis of these items is

summarized in Table 4.3:

60

Table 4.3: Summary of CTT Analyses of Item 34 and Item 38

For Item 34 with DI= -0.06, it is observed that high achievers are more likely to choose

the wrong answer (option C) than the low achievers are. This is not a desirable situation,

and therefore, option C needs to be examined to determine why it attracts top

examinees. Similarly, Item 38 signals problem with its negative value of DI.

Examining option B, it is noticed that high ability students tend to choose this option

compared to the other three options (i.e., options A, C and D). Interestingly, the

proportion of students choosing the key answer – option D- is the lowest. This indicates

that there is a possibility that this item is miskeyed.

A graphical representation of the above description is illustrated by the ICCs of the

items as shown in Figure 4.6 and Figure 4.7. These ICC plots help the item writers to

visualize whether the distractors and the right answer function in the way they are

expected.

ITEM 34 ------- Cases for this item 8472 Discrimination -0.06 Item Threshold(s): 0.50 Weighted MNSQ 1.16 Item Delta(s): 0.50 ------------------------------------------------------------------------- Label Score Count % of tot Pt Bis t (p) PV1Avg:1 PV1 SD:1 ------------------------------------------------------------------------- A 0.00 455 5.37 -0.10 -9.54(.000) -0.42 0.57 B* 1.00 2961 34.95 -0.06 -5.30(.000) -0.16 0.66 C 0.00 4175 49.28 0.15 13.51(.000) -0.11 0.59 D 0.00 871 10.28 -0.07 -6.50(.000) -0.30 0.59 X 0.00 10 0.12 -0.02 -1.42(.157) -0.51 0.93 ========================================================================= ITEM 38 ------- Cases for this item 8472 Discrimination -0.11

Item Threshold(s): 3.33 Weighted MNSQ 1.07 Item Delta(s): 3.33 ------------------------------------------------------------------------- Label Score Count % of tot Pt Bis t (p) PV1Avg:1 PV1 SD:1 ------------------------------------------------------------------------- A 0.00 4759 56.17 -0.08 -7.45(.000) -0.20 0.61 B 0.00 2792 32.96 0.19 18.14(.000) -0.02 0.64 C 0.00 608 7.18 -0.12 -10.75(.000) -0.39 0.52 D* 1.00 299 3.53 -0.11 -10.15(.000) -0.40 0.55 X 0.00 14 0.17 -0.02 -1.61(.108) -0.43 0.64 ========================================================================= *Correct answer

61

Figure 4.6: ICC Plot of Item with Negative Discrimination Index (Item 34)

Figure 4.7: ICC Plot of Item with Negative Discrimination Index (Item 38)

In Figure 4.6, the key is option B (represented by the green line). Students who belong

to high ability group (theta above zero) fail to choose option B, and there is likelihood

for them to choose option C (represented by the cyan line), instead. In the case of Item

38, it is clear that options D (the answer –in red line) and B (in green line) do not

function well the way as expected. Both options behave adversely from their original

62

purposes. Supposedly, option D attracts more high ability students than lower ability

students. In contrast, the incorrect answer (option B) appears to function falsely as the

key answer because it is more appealing to the high ability examinees.

In short, the discovery of many items with low discriminating power suggests that the

items in this particular MUET reading paper need considerable revision.

Fit Indices

From the view point of the Rasch model, the evaluation of how well the test items

contribute to the whole dataset can be investigated through the fit statistics. The basic

guideline to examine the fitness of the item depends on two statistics; the mean square

(MNSQ) and t .

The findings of estimation for item parameter reveal that there are many misfits in this

test. Apparently, the sample size here plays an important role in determining the fit of

the data to the model. We can see that the spread of the mean-square values is narrower,

between 0.98 and 1.02. Again, it should be kept in mind that the sensitivity of mean

square statistic and t statistic to sample size should be taken into consideration in

interpreting the misfit of the items.

In Table 4.4, it is noted that there is significant misfit in Item 5, Item 18, Item 23, Item

34 and Item 35. As an example, Item 34 is likely not to fit the model because the value

of MNSQ (1.16) is not in the range of CI (0.99, 1.02) and t value (17.6) is greater than

2.

63

Table 4.4: Examples of Misfit Items

Figure 4.8 below represents the visual manifestation of misfit item. It shows that the

empirical curve (in green line) is far away from the modelled curve (in blue line). This

is an example that the probability of obtaining the correct answer on this item does not

match the expectation of the model. This is an indication that the item does not fit the

model relatively well.

Figure 4.8: ICC Plot of Misfit Item (Item 34)

================================================================================ ConQuest: Generalised Item Response Modelling Software Tue Aug 31 18:39 2010 TABLES OF RESPONSE MODEL PARAMETER ESTIMATES ================================================================================ TERM 1: item ---------------------------------------------------------------------------------- VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------------- ----------------------- -------------------- item ESTIMATE ERROR^ MNSQ CI T MNSQ CI T ---------------------------------------------------------------------------------- 5 5 -0.196 0.017 1.09 ( 0.97, 1.03) 6.0 1.08 ( 0.99, 1.01) 13.3 18 18 -0.370 0.017 1.09 ( 0.97, 1.03) 5.8 1.08 ( 0.99, 1.01) 12.6 23 23 0.432 0.017 1.11 ( 0.97, 1.03) 6.9 1.09 ( 0.98, 1.02) 10.7 34 34 0.504 0.018 1.22 ( 0.97, 1.03) 13.6 1.16 ( 0.98, 1.02) 17.6 35 35 -0.566 0.017 1.09 ( 0.97, 1.03) 5.8 1.07 ( 0.99, 1.01) 10.5 --------------------------------------------------------------------------------

64

Test Reliability

In traditional item analysis, reliability is an essential characteristic of a good test

(McNamara, 1996; Allison, 1999). This is because if a test does not measure

consistently (reliably), one could not infer the score resulting from a particular

administration of a test to be an accurate index of students’ achievement. The reliability

of a test therefore refers to the extent to which the test is likely to produce consistent

scores.

Table 4.5: Summary of CTT Item Analysis

For this test, the coefficient alpha shown in Table 4.5 is 0.78. This signifies that the

overall test is moderately reliable. This reliability index concludes that the range of

most items is good for a classroom test with probably a few items needing further

inspection.

Another essential index which is linked to the reliability coefficient is the standard error

of measurement (SEM). Conceptually, the SEM is related to test reliability because it

indicates the amount of error contained in an observed score of the examinees. The

SEM is denoted as a function of the reliability and standard deviation (SD) of a test

(Reynolds et al., 2009). General prediction of the overall performance of a test can be

seen through the index of the SEM. The smaller the error, the more accurate the

measurement provided by the test. It is noted that the SEM is about 3. Also, observe that

as the reliability index increases, the SEM decreases (see Table 4.6).

N 8472 Mean 21.19 Standard Deviation 6.33 Variance 40.09 Skewness 0.40 Kurtosis -0.25 Standard error of mean 0.07 Standard error of measurement 2.99 Coefficient Alpha 0.78

65

Ebel and Frisbie (1991) explain that test reliability is sensitive to items characteristics

particularly discrimination index. Hence, in order to obtain a higher reliability,

Alugamalai and Curtis (2005) suggest that low discriminating items should be removed

from the test. Table 4.6 demonstrates how deletion of non-discriminating items can

improve the reliability of a test.

Table 4.6: Summary of Reliability Analyses

A 43-item data in Table 4.6 is a dataset which removes 2 items with negative value of

discrimination index. In the 36-item data, 9 items with a discrimination index ≤ 0.10

have been deleted from the dataset. Notice that the removal of low discriminating items

improves the reliability index of the test. Interestingly, fewer items lead to higher

reliability – evidence that poor items ought to be removed.

43-ITEM DATA N 8472 Mean 20.80 Standard Deviation 6.36 Variance 40.44 Skewness 0.37 Kurtosis -0.34 Standard error of mean 0.07 Standard error of measurement 2.94 Coefficient Alpha 0.79 ============================================================== 36-ITEM DATA N 8472 Mean 17.64 Standard Deviation 5.93 Variance 35.20 Skewness 0.33 Kurtosis -0.46 Standard error of mean 0.06 Standard error of measurement 2.66 Coefficient Alpha 0.80 ===============================================================

66

4.3 RELATIONSHIP BETWEEN ITEM CHARACTERISTICS AND ITEM

DIFFICULTY

A regression analysis has been performed to show the predictive relationship between

dependent variable (item difficulty) and the six sets of item characteristics. Here,

stepwise method, which automatically sets the statistical models with the highest

multiple correlations in order, has been employed.

The characteristics of the items studied are shown in Table 4.7.

Table 4.7: Characteristics of the MUET Reading Items

Type of variable

Predictors Characteristics of item N

Passage Length of the passage

More than 25 lines More than 35 lines More than 45 lines

7 21 17

Question type

Type of question Retrieving directly stated information (RI) Interpreting explicit information (IE) Interpreting by making inference (II) Reflecting on texts (RF)

10 12 19 4

Inference type No inference Paraphrase inference Across-sentence (bridging) inference Macrostructure (gist) inference Reader-based (prior knowledge) inference

3 7 30 1 4

Question format

Structure of the responses/ alternatives

Short lists of 1-4 words Simple phrases Complex phrases containing clauses that could stand alone as sentences

18 13 14

Number of options

3-option 4-option

29 16

Plausibility of the distractors

No response options are plausible One or more options are plausible

23 22

The analyses are expected to explore specific item characteristics associated with the

difficulty of the reading comprehension test items. The summary of the regression

analyses is shown in Table 4.8.

67

Table 4.8: Results of Linear Regression Model R R square Adjusted R

square 1 (Constant)

plausibility of the distractors

.371

.137

.117 2 (Constant)

plausibility of the distractors, structure of the responses

.514 .264 .229

3 (Constant) plausibility of the distractors, structure of the responses, Inference level

.587 .345 .297

As shown in Table 4.8, the values of R indicate the correlation between the predictors

and the dependent variable. It can be seen that plausibility of the distractors accounts for

14 percent of the variance in item difficulty. Model 3 (R =.587) demonstrates that the

correlations between dependent variable and the three predictors; plausibility of the

distractors, structure of the response and inference type, are fairly strong.

The above information also reports measure of the model fit. Large values of R square

indicate that the models fit the data well. For example, 34.5 percent of variation in the

dependent variable is explained by Model 3. Apparently, the effects of the three

predictors are significant for reading items.

The regression results clearly reveal that there is a significant relationship between test

features and item difficulty. The findings single out three significant predictors of item

difficulty: plausibility of the distractors (p =.003), structure of the response (p =.011)

and inference type (p =.030)

So, it is of great interest to focus on these three variables. Table 4.9 depicts the result of

coefficient analysis which shows the statistical significance15 and the direction of the

correlation of these predictors.

15 Conventionally set at p ≤ 0.05

68

Table 4.9: Results of Coefficient Analysis

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig. B Std. Error Beta 1

(Constant) -1.087 .438 -2.481 .017 plausibility of the distractors .730 .279 .371 2.617 .012

2

(Constant) -.494 .465 -1.062 .294 plausibility of the distractors

.880 .267 .447 3.301 .002

structure of the responses -.427 .159 -.364 -2.688 .010

3

(Constant) 1.363 .589 -2.315 .026

plausibility of the distractors .818 .256 .415 3.192 .003

structure of the responses

-.406 .152 -.346 -2.669 .011

Inference Type .317 .141 .286 2.249 .030 a Dependent Variable: item difficulty estimate

It is noted that plausibility of the distractors is highly significant (p =.003). The positive

sign of the coefficient signals positive relationship which indicates that items with more

than one plausible distractor are related to the difficulty of the items. Next, the effect of

structure of response is also statistically significant (p =0.11). Nonetheless, its

coefficient relation is negative. It seems to imply that the longer the structure of the

response, the easier the items – this result is somewhat unexpected. The third best

predictor, inference type, also appears to be related significantly to the difficulty of the

items in this test. This supports the notion that items requiring higher level of inference

skills are difficult to answer.

The illustration of the direction of the relation of the above variables can be seen from

the following graphs. Figure 4.9 and Figure 4.10 visualize the positive coefficient

between plausibility of the distractors and inference level and item difficulty. It is seen

that the more the options with plausible distractors and the higher the inference level,

the more difficult the item.

69

Figure 4.9: Boxplot of Interaction between Plausibility of Distractors and Item

Difficulty

Figure 4.10: Boxplot of Interaction between Inference Level and Item Difficulty

70

Figure 4.11: Boxplot of Interaction between Structure of Distractors and Item

Difficulty

Conversely, the correlation between structure of response and the dependent variable is

negative (Figure 4.11). Surprisingly, items with short list of 1-4 words turn out to be

difficult items, while items with complex structures are the easier items. This is

probably due to the interaction of this variable with the other predictors, plausibility of

the distractors or inference level. The test data in Table 4.10 reveals that this MUET

reading test consisted of 40% items with short list of responses. Out of 18 items, 13

were categorised in levels 3, 4 and 5 of inference types. On the other hand, none of the

longer and more complex phrases are those items requiring macrostructure and prior

knowledge inferences.

71

Table 4.10: Interaction between Structure of Responses and Inference Type

Predictors Characteristics of item N Structure of the responses/ alternatives

Short list

Simple phrase

Complex phrase

Inference type

No inference Paraphrase inference Across-sentence (bridging) inference Macrostructure (gist) inference Reader-based (prior knowledge) inference

3 7 30 1 4

0 5

11 0 2

3 0 7 1 2

0 2

12 0 0

45 18 13 14

The preceding observations summarize that the other variables (i.e. length of the

passage, type of question and number of options) have not played an important role in

influencing the item difficulty of the MUET reading test.

In short, items in this data set are more difficult if they have the following

characteristics:

Contain more than one plausible distractors

Consist of simple and short option structure

Require higher level of inference skills

It is also important to conclude here that two predictors – plausibility of the distractors

and structure of the responses – reveal the presence of construct-irrelevant variance in

this set of MUET reading tests.

72

4.4 RESULT OF DIF ANALYSES

DIF occurs when people from different groups (commonly gender, nationality or

ethnicity) with the same latent trait have a different probability of giving a certain

response to an item.

In the Rasch model, an item exhibits DIF when people from different groups of same

underlying true ability have a different probability to give a certain response. The model

advocates that the index value outside the range of -2 and +2 is statistically significant.

4.4.1 Findings of Gender DIF

Table 4.11: Summary of Overall Performance between Males and Females

Table 4.11 shows the comparison of overall performance between groups on this MUET

reading comprehension test. A negative sign is used to indicate the easiness of the test

between the two groups; male (L) and female (P). The result concludes that male

students have performed slightly better than female students.

The difference is statistically significant (3.375). The fact that its parameter estimate is

more than twice its standard error indicates that this variance is statistically significant

(Wu et al., 2007).

The content of Table 4.12 reports the result of DIF investigation on gender differences

for this data. It reveals that there are 19 items which are flagged as having DIF. Notice

the gender-by-item interaction estimates have more than twice its standard error, which

TERM 1: gender ------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------------- ----------------------- -------------------- gender ESTIMATE ERROR^ MNSQ CI T MNSQ CI T ------------------------------------------------------------------------ L -0.027 0.008 1.04 (0.95,1.05) 1.6 1.04 (0.95,1.05) 1.5 P 0.027* 0.008 1.04 (0.96,1.04) 1.8 1.03 (0.96,1.04) 1.6 ------------------------------------------------------------------------ An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability Not Applicable Chi-square test of parameter equality = 11.10, df = 1 ^ Empirical standard errors have been used

73

indicate significant difference between these two subgroups. Ten items (Item 8, 12, 14,

16, 26, 33, 34, 35, 37,and 44) favour females while the other nine items (Item 2, 4, 6, 9,

13, 19, 21,40 and 45) seem to favour male students.

Table 4.12: Parameter Estimates of Gender DIF Investigation

TERM 3: gender*item -------------------------------------------------------------------------------- VARIABLES UNWEIGHTED FIT WEIGHTED FIT ------------------ ----------------------- ----------------------- gender item ESTIMATE ERROR^ MNSQ CI T MNSQ CI T -------------------------------------------------------------------------------- 1L 2 -0.076 0.023 0.89 (0.95,1.05) -4.9 0.90 (0.98,1.02) -10.8 2P 2 0.076* 0.023 0.90 (0.96,1.04) -4.9 0.91 (0.98,1.02) -11.2 1L 4 -0.065 0.023 0.94 (0.95,1.05) -2.7 0.94 (0.98,1.02) -5.7 2P 4 0.065* 0.023 0.95 (0.96,1.04) -2.5 0.96 (0.98,1.02) -5.5 1L 6 -0.121 0.024 0.94 (0.95,1.05) -2.6 0.96 (0.97,1.03) -3.2 2P 6 0.121* 0.024 0.95 (0.96,1.04) -2.7 0.96 (0.98,1.02) -4.9 1L 8 0.084 0.025 1.09 (0.95,1.05) 3.6 1.06 (0.97,1.03) 3.6 2P 8 -0.084* 0.025 1.10 (0.96,1.04) 4.8 1.05 (0.97,1.03) 3.8 1L 9 -0.047 0.028 1.10 (0.95,1.05) 3.9 1.04 (0.95,1.05) 1.6 2P 9 0.047* 0.028 1.09 (0.96,1.04) 4.2 1.04 (0.96,1.04) 1.7 1L 12 0.064 0.023 1.01 (0.95,1.05) 0.4 1.00 (0.98,1.02) 0.1 2P 12 -0.064* 0.023 1.01 (0.96,1.04) 0.7 1.01 (0.98,1.02) 1.1 1L 13 -0.076 0.023 0.96 (0.95,1.05) -1.6 0.96 (0.98,1.02) -3.9 2P 13 0.076* 0.023 0.96 (0.96,1.04) -1.9 0.96 (0.98,1.02) -4.6 1L 14 0.103 0.024 0.99 (0.95,1.05) -0.3 0.99 (0.97,1.03) -0.9 2P 14 -0.103* 0.024 1.02 (0.96,1.04) 0.9 1.01 (0.98,1.02) 1.1 1L 16 0.066 0.023 1.02 (0.95,1.05) 0.6 1.01 (0.98,1.02) 1.5 2P 16 -0.066* 0.023 1.02 (0.96,1.04) 1.0 1.02 (0.98,1.02) 2.4 1L 19 -0.053 0.023 1.03 (0.95,1.05) 1.1 1.02 (0.98,1.02) 2.1 2P 19 0.053* 0.023 1.00 (0.96,1.04) -0.2 1.00 (0.98,1.02) -0.5 1L 21 -0.056 0.023 1.03 (0.95,1.05) 1.2 1.02 (0.98,1.02) 2.1 2P 21 0.056* 0.023 1.02 (0.96,1.04) 0.9 1.02 (0.98,1.02) 2.0 1L 26 0.100 0.029 0.96 (0.95,1.05) -1.9 0.98 (0.95,1.05) -0.7 2P 26 -0.100* 0.029 0.96 (0.96,1.04) -2.3 0.99 (0.96,1.04) -0.3 1L 33 0.066 0.024 0.89 (0.95,1.05) -4.9 0.92 (0.97,1.03) -5.3 2P 33 -0.066* 0.024 0.91 (0.96,1.04) -4.6 0.95 (0.98,1.02) -4.3 1L 34 0.054 0.024 1.21 (0.95,1.05) 8.0 1.15 (0.97,1.03) 9.9 2P 34 -0.054* 0.024 1.24 (0.96,1.04) 11.1 1.17 (0.98,1.02) 14.8 1L 35 0.081 0.023 1.08 (0.95,1.05) 3.3 1.07 (0.98,1.02) 6.2 2P 35 -0.081* 0.023 1.09 (0.96,1.04) 4.7 1.08 (0.98,1.02) 8.2 1L 37 0.081 0.027 0.86 (0.95,1.05) -6.3 0.93 (0.96,1.04) -3.5 2P 37 -0.081* 0.027 0.86 (0.96,1.04) -7.6 0.94 (0.96,1.04) -3.4 1L 40 -0.066 0.023 0.96 (0.95,1.05) -1.7 0.96 (0.98,1.02) -3.6 2P 40 0.066* 0.023 0.96 (0.96,1.04) -2.1 0.96 (0.98,1.02) -4.9 1L 44 0.083 0.025 0.99 (0.95,1.05) -0.5 0.99 (0.97,1.03) -0.5 2P 44 -0.083* 0.025 1.03 (0.96,1.04) 1.7 1.02 (0.97,1.03) 1.8 1L 45 -0.076* 0.026 0.97 (0.95,1.05) -1.3 0.97 (0.96,1.04) -1.7 2P 45 0.076* 0.026 0.96 (0.96,1.04) -1.8 0.97 (0.97,1.03) -1.8 --------------------------------------------------------------------------------

74

Another method to trace DIF in an item is through the visual observation of ICC. Figure

4.12 of Item 6 demonstrates that the empirical curves of the two groups (male

represented in blue line, while female represented in green line) are far apart from each

other. This ICC also points out that the probability of male students to answer this item

correctly is higher than the females.

Figure 4.12: ICC Plot of Item with Gender DIF (Item 6)

Figure 4.13 represents the graphical display of an item with no DIF. Unlike items which

exhibit DIF, this item (Item 15) shows that the empirical curves which represent

females and males are very close. This is an indication that there is no significant

difference between males and females in responding to this item.

75

Figure 4.13: ICC Plot of Item without Gender DIF (Item 15)

A further examination of DIF for all the items in this test signifies that there is a pattern

between these two groups in responding to a specific type of passage. Interestingly, it is

noted that three out of six passages are relatively easier for male students. These

passages deal with topics related to legal issue (Passage 1), business (Passage 3) and

leadership (Passage 6). On the other hand, topics related to

advertisement/communication (Passage 2) and travel (Passage 5) appear to advantage

female students.

4.4.2 Findings of Ethnicity and State DIF

A description of DIF analysis for ethnicity and state is treated as one discussion due to

the fact that these two facets are related to each other. In the context of this study, it

appears that demographic location is characterized by its multi-racial ethnic

composition. Geographically, many non-Bumiputeras reside in capital territory, Kuala

Lumpur and Bumiputeras (more than 30 different ethnics) are found in Sabah.

The estimation of ethnicity parameter in Table 4.13 informs that this test is relatively

difficult for the bumiputeras, who largely made up of the candidates from Sabah.

Presumably, this is because Sabahans, who are multilingual, do not consider English

their second language. This finding is consistent with the summary result for the

76

difference in the state as the test appears to advantage candidates in Kuala Lumpur, who

are familiar with the use of English language in their daily life.

Table 4.13: Summary of Overall Performance between Ethnics and States

A closer look at the results of the analyses revealed a significant discrepancy between

the two groups and states. Notice the parameter estimate for both analyses is larger than

twice its standard error. Parameter estimate for state, for example, is 82 times greater

than the standard error. This huge number implies that there is great difference between

Bumiputera and non-Bumiputera and between candidates in Kuala Lumpur and Sabah.

The notion of DIF between states and ethnics can be examined by looking at the ICC

plots of the individual items. For example, Item 30 in Figure 4.14 illustrates that there is

TERM 1: Ethnicity ------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------- ----------------------- ---------------------- ethnic ESTIMATE ERROR^ MNSQ CI T MNSQ CI T -------------------------------------------------------------------------------------------------------------- 1 B 0.255 0.004 1.02 (0.97,1.03) 1.2 1.02 (0.97,1.03) 1.1 2 N -0.255* 0.004 1.09 (0.92,1.08) 2.2 1.08 (0.92,1.08) 1.9 ------------------------------------------------------------------------ An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability Not Applicable Chi-square test of parameter equality = 4884.57, df = 1 ========================================================================

TERM 1: State ------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------- ------------------ ------------------- state ESTIMATE ERROR^ MNSQ CI T MNSQ CI T ------------------------------------------------------------------------ 1 Kuala Lumpur -0.329 0.004 1.09(0.95,1.05) 3.9 1.08(0.95,1.05) 3.5 2 Sabah 0.329* 0.004 0.99(0.96,1.04)-0.7 0.99(0.96,1.04)-0.6 ------------------------------------------------------------------------ An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability Not Applicable Chi-square test of parameter equality = 8093.95, df = 1 ========================================================================

77

a wide gap between the empirical curves of Bumiputera (blue line) and non-Bumiputera

(green line). Here, the empirical curves are likely to be far away from each other.

Figure 4.14: ICC Plot of Item with Ethnicity DIF (Item 30)

Full analyses of DIF between ethnicity and states report that there are many items

flagged as having DIF. There are 37 items with DIF for states and 35 items for

Bumiputera and non-Bumiputera. The following table summarizes ten items with the

largest DIF index for both facets.

78

Table 4.14: Items with the Largest DIF Indices

ETHNICITY STATE ITEM DIF INDEX IN FAVOUR OF ITEM DIF INDEX IN FAVOUR OF

33 23.9 Non Bumiputera 38 36.5 Sabah

34 22.2 Bumiputera 34 29.5 Sabah

23 22.1 Bumiputera 30 20.6 Kuala Lumpur

38 20.4 Bumiputera 2 19.6 Kuala Lumpur

30 18.4 Non Bumiputera 33 19.2 Kuala Lumpur

2 18.2 Non Bumiputera 37 18.3 Kuala Lumpur

17 14.7 Non Bumiputera 23 16.5 Sabah


32 12.5 Non Bumiputera 17 15 Kuala Lumpur


Notice here that those items which show the existence of DIF between the states and

ethnics tend to have large DIF indices.

An interesting examination of the results in Table 4.14 shows that there are 9 items

which display DIF in both facets. Items 2, 17, 30, and 33 appear to advantage the

candidates from Kuala Lumpur and non-Bumiputera. On the other hand, Items 23, 34,

35, 38 and 42 are relatively easier for Bumiputera and candidates from Sabah. A

possible explanation for this observation can be attributed to the composition of

population in both states.

79

CHAPTER 5: DISCUSSION AND CONCLUSION

5.1 INTRODUCTION

This study sets out to explore several issues which may impact the difficulty level of

MUET, the high-stake English test used for admission into Malaysian public

universities. A primary interest of the study has been focused on assessing the quality of

the highest weighted component of MUET, that is the reading test. The 45 test items

have been scrutinised through item analyses based on CTT and the Rasch model. Also,

the study investigates to what extent features of test items and student background

characteristics influence the difficulty of the items. These two issues have been

examined through regression analysis and DIF analysis respectively.

The subsequent sections present the summary of the main findings. The discussion

focuses on addressing the research questions raised in Chapter 1. Furthermore, this

chapter discusses the implications of the findings for practice and future research.

5.2 DISCUSSION OF MAJOR FINDINGS

This section discusses the findings with respect to the four research questions posed in

Chapter 1. The results from item analysis, regression analysis and DIF analysis are used

to address the questions.

How do the items spread in terms of their difficulty values and the ability of the

students?

It is concluded that the test was reasonably well targeted for the ability of students. On

the basis of CTT and Rasch analyses, it can be seen that there is equal distribution of

easy and difficult items in the test. The Rasch analyses provide a clear view of the item

distribution and student ability through the item-person map. From the map, it is clear

that the majority of the items are distributed around the average ability of students at

logit -0.168.

80

The person-item map from the Rasch model signifies its advantage over CTT.

McNamara (1996) views the calibration of item difficulty and person ability on the

same scale as one of the key features of the Rasch model. This is because the map is

very useful for the test constructors to trace how well the items are matched to the

ability of students.

How good are the items of the MUET reading test?

The quality of the items was examined using discrimination index, reliability index and

fit statistics.

CTT analysis of item discrimination shows that 60 percent of the items are classified as

weak items with 9 items having discrimination index ≤ 0.10. Further investigation of

items with a negative discrimination index (Item 34 and Item 38) indicates that the

response options are miskeyed and the distractors are misleading. As advocated by

Popham (2000) and Reynolds et al. (2009), these problematic items need closer

examination. If such items were to be retained for future use, it is evident that they

would require revision in terms of rewording or omitting the options given.

The overall consistency of the test is moderately reliable. Its coefficient alpha of 0.78

suggests that there are several items which need further inspection. The identification of

many low discriminating items appears to contribute to this moderate reliability index.

It would be appropriate for this test to have reliability estimate as high as 0.90 or at least

0.85 due to its significance to pre-degree students who wish to enrol in undergraduate

programme in public universities. This is supported by Nunnally and Berstein (1994)

who recommended a reliability estimate of 0.90 or even 0.95 for tests which involve

decision making about individuals.

The Rasch analysis of fit statistics reveals that there are many misfitting items in the

test. This signals poor item construction. The discovery of miskeyed / misleading

distractors of Item 34 and Item 38 and other low discriminating items can be traced as

the source of misfits. However, interpretation of misfit should be treated with caution.

Misfitting items are not necessarily problematic items. Keeves and Alagumalai (1999)

81

warned that the sensitivity of fit statistics to sample size should be taken into account in

interpreting items fit. With a sample size of 8472, departure from the model can be

detected easily through the fit statistics. Thus, misfitting items should not be discarded

without good reason.

From the complementary findings based on the two measurement approaches, it appears

that the IRT model supplements rather than contradicts CTT (Lord, 1980; Barnard,

1999; Zubairi & Abu Kassim, 2006).

To what extent do selected features of the test items contribute to the item difficulty in

the MUET reading test?

Characteristics of test items have a profound effect on the item difficulty of this reading

test. Three out of six predictors show a strong relationship between the predictors and

item difficulty. Plausibility of the distractors has the most significant relationship,

followed by structure of the options and inference level.

There are significant and positive relationships between two independent variables

(plausibility of distractors and inference level) and item difficulty. The difficulty level

of item is primarily determined by plausibility of distractors which independently

contributes 37 percent of the whole variance. A positive coefficient (.818) implies that

an item tends to be more difficult when one or more distractors are plausible. This

finding is in line with Rupp et al. (2001) and Drum et al. (1981) who discovered an

increase of difficulty level with the increase number of plausible distractors. This

implies that items with more plausible distractors are difficult to answer because they

seem to share several features with the correct answer.

The inference level of the questions also influences difficulty level, though the effect is

relatively small. Apparently, items requiring a higher level of cognitive skills turn out to

be difficult. Item 42 (one of the most difficult items), for instance, requires students to

bring in their prior knowledge of language function in order to respond to the question

correctly. This result corroborates the view of Hamzah and Abdullah (2009) and

82

Sarudin and Zubairy (2008) who claimed that lack of metacognitive skills is one of the

factors contributing to reading problems among Malaysian students.

In addition to the above variables, item difficulty appears to be influenced by the

structure of the options in a negative direction (coefficient = -.406). Surprisingly, this

group of examinees found items with longer and complex structure easier than items

with options consisting of 1-4 words. One possible explanation for this unexpected

trend is that there are strong interactions among the three significant predictors.

Possibly, the characteristics of the other two variables (i.e. plausibility of the distractors

and inference level) have played a better role in influencing the item difficulty. The

results show that items with option of longer and complex phrases are those items

requiring lower inference skills. In contrast, items with simple and short options need

students to use macrostructure inference and schemata to reach for an answer.

An important aspect that has emerged from the above situation is that there is evidence

of interaction among the predictors in determining the difficulty level of items. This is

supported by the significant regression results which imply that the three variables,

alone or in combination, have accounted for significant variance in item difficulty in

this MUET reading test.

The exploration in this study also indicates that the presence of other variables – length

of the passage, type of question and number of options – do not seem to make the items

more difficult as hypothesized. In comparison to earlier research (e.g. Just & Carpenter,

1992; Ozuru et al., 2008; Rupp et al., 2001), this study does not find that the longer the

passage, the more difficult the question. In addition, explicitness and implicitness of

information in a text and number of options are not significant factors influencing

examinees’ performance on the test. The contradictory results might be due to an

uneven distribution of coded categories within each of the variables. As an example, for

the length of the passage, almost half of the items are coded within the category of

passage having more than 35 lines.

83

Taken altogether, these findings suggest that difficulty level of the MUET reading test

is significantly affected by variables related to question format (plausibility of

distractors and structure of the options) rather than passage-related variable and

question format variables. In other words, there is presence of construct irrelevance

variance in this reading test and this may result in negative wash back to the teaching

and learning of MUET syllabus, particularly the reading component. Test developers of

MUET, therefore, should test the range of the construct that needs to be measured and

avoid test-method effects and other contributors to construct irrelevant variance.

Is there any differential item functioning in the MUET reading test in terms of


From the gender analysis, some test items are found to be easier for one group than the

other. Ten items which exhibit DIF tend to favour females while the other 9 advantage

males. A possible explanation for gender variations is resulted from the response of the

two groups to subject matter of the passages. The results show that items that originate

from male-friendly topics (i.e. legal issues, business and leadership) are relatively easier

for males than females. In contrast, items from passages which deal with

communication and travel seem to favour females.

Such differences are considered to be based purely on gender preference. This confirms

the view of previous researchers (e.g. O’Neill & McPeek, 1993; Dolittle & Welch,

1989) that males and females display some distinct preference regarding reading. The

responses of examinees in this test indicate that females are interested in humanities-

related reading materials. On the other hand, males prefer topics related to law, business

and governance perhaps because they play dominant role in these fields.

In terms of ethnicity and geographical location, the findings reveal that there is large

disparity between groups being compared. Many items appear to be difficult for

bumiputera and Sabahans and vice-versa. Similarly, this significant discrepancy is

attributed to the real difference between the groups. This means that population

composition in both states plays an important role in affecting students’ responses to an

84

item. Many items are easy for non-bumiputeras who mainly constitute candidates from

Kuala Lumpur; the majority of these prefer to use English in their daily life.

The results in this study replicate observations from previous projects (Elder, 1996;

Chen & Henning, 1985) which concluded that actual differences between subgroups of

examinees are seen as a potential source of DIF. Therefore, before omitting a

‘problematic’ item, it should be reviewed thoroughly by the panel of experts to

illuminate the possible causes of its significant difference.

5.3 IMPLICATIONS OF THE FINDINGS

The findings yielded in this study have implications for different groups of people. The

first group is language teachers. Utilization of item analysis can help teachers to identify

misconceptions in the materials that need further explanation. With regard to the effects

of test features on students’ performance in reading test, it provides insights for teachers

to focus more on the features that have strong influence on the difficulty of an item. In

addition, DIF analysis helps teachers to understand the difference between groups so

that they can look for solutions to decrease the gap.

The second group is test developers. It is clear that an examination of individual items is

fundamental in test evaluation as it allows test developers to identify problematic items

and remedy them. Moreover, the complementary nature of CTT and the Rasch model

demonstrates that test constructors can incorporate both approaches as measurement

strategies for test design and evaluation. Furthermore, the evidences of test features and

individuals’ characteristics effect on examinees’ performance on an item informs test

makers to balance content or item features in the test specification, so that they can

reduce the variation effect. Findings from both analyses also can be used as a test

validation to clarify what the test is measuring. The study also shows that DIF is

difficult to interpret. Therefore, removal of items flagged as DIF should be considered

seriously.

85

The last group is researchers of reading and language assessment. The findings have

added support to the notion that CTT and the Rasch model are complementary

approaches which prove to be useful tools for language test development and

evaluation. In addition, test feature effects on item difficulty provide useful information

about specific variables that can influence students’ performance on a reading test. The

findings of DIF also show that DIF is not necessarily an indication that the test

disadvantages one particular group. This reemphasizes the standpoint that DIF is not

necessarily evidence for test bias (Angoff, 1993; Camilli & Shepard, 1994; McNamara

& Roever, 2006; de Ayala, 2009). This study also presents directions for researchers to

investigate several issues identified below.

5.4 DIRECTIONS FOR FUTURE RESEARCH

The findings of the study have shed light on directions for future research. The

following are several areas that seem ripe for further exploration:

1. Since the results are based on a limited item pool (45 items) and passages (6

passages), there is a need to replicate the study on a bundle of items and

passages so that the findings can be generalized in a wider context. It would also

be interesting to explore the same issues on other components of MUET –

listening, speaking and writing. This would provide a clear picture of the factors

that might affect lack of satisfactory grade on MUET.

2. The unexpected effect of the structure of the options clearly deserves more

attention in future. Further investigation is necessary to explain the negative

effects of this variable on the difficulty level of reading items.

3. Increasing the number of variables might be useful in examining the influence of

test features on students’ success on reading assessment or other types of

language tests. Other important item characteristics that might be added to future

study are correct-answer variables, stem-related variables, vocabulary-related

variables and new passage-related variables.

4. Given the fact that pre-university students are streamed into various classes (i.e.

Science, Art and Commerce), it would be useful to investigate the effect of

students’ background discipline on reading comprehension. Presumably,

86

examinees’ responses to specific subject matter are related to their prior

knowledge – particularly to the field/course they have taken.

5. Another important area of additional research is interaction between indicators

of DIF and item characteristics as a source contributing to item difficulty. It

would be of great interest to explore if a causal link between these two facets

could be established in influencing students’ responses to an item.

5.5 CONCLUSION

This study is built primarily on the exploration of factors that affect students’

performance on reading test. Previous research on reading English as a second language

in the Malaysian context, has seldom examined the role of item-level analysis as an

explanation for why the reading test of MUET appears to be challenging for Malaysian

students. The current study addresses this gap.

One salient finding that has emerged from this study is the usefulness of CTT and the

Rasch model as a tool for measurement and evaluation of language test. It is shown that

both psychometric theories complement each other. The study also concludes that

characteristics of the test items are significant factors in item difficulty. It is found that

plausibility of distractors, structure of the options and inference level have strong

influence on determining the difficulty level of this reading test. The DIF procedure,

furthermore, provides insights about the influence of examinees’ background on their

success to respond to individual items. Real differences between gender, ethnicity and

state are seen as the source of DIF in this test.

Conclusions derived from this study should be interpreted in the light of a few

constraints. As a small and focused analysis, the results of this research may provide

useful insights for researchers, test developers and educators about important issues that

need to be dealt with in reading assessment.

87

REFERENCES:

Abdul Majid, F., M.Jelas, Z., & Azman, N. (2002). Selected Malaysian adult learners' academic reading strategies: a case study (Publication. Retrieved 15 April 2010: http:www.face.stir.ac.uk/documents/Paper61

Adams, R. J. (2005). Reliability as a measurement design effect. Studies in Educational

Evaluation, 31(2-3), 167 - 172. Adams, R. J., & Khoo, S. T. (1993). QUEST: The interactive test analysis system.

Melbourne: ACER. Adamson, H. D. (1993). Academic competence: theory and classroom practice --

preparing ESL for content courses. White Plains, NY: Longmans. Alagumalai, S., & Curtis, D. (2005). Classical test theory. In S. Alagumalai, D. D.

Curtis & N. Hungi (Eds.), Applied Rasch measurement: a book of exemplars: papers in honour of John P.Keeves (pp. 1-14). Dordrecht ; Norwell, MA Springer.

Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press. Alderson, J. C., & Urquhart, A. H. (1983). The effect of student background discipline

on comprehension: a pilot study. In A. Hughes & D. Porter (Eds.), Current development in language testing (pp. 121-128). London: Academic Press.

Allison, D. (1999). Language testing and evaluation: an introductory course.

Singapore: Singapore University Press. American Psychological Association, American Educational Research Association, &

National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association.

Angoff, W. H. (1993). Perspective on differential item functioning methodology. In P.

W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3 - 33). New Jersey: Lawrence Erlbaum Associates, Publishers.

Athanasou, J. A., & Lamprianou, I. (2004). Reading in one's ethnic language: a study of

Greek-Australian high school students. Australian Journal of Educational & Developmental Psychology, 4, 86 - 96.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:

Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford

University Press.

88

Baker, D. (1989). Language testing: a critical survey and practical guide. London: Edward Arnold.

Baker, E. L. (1989). Mandated tests: educational reform or quality indicator. In B. R.

Gifford (Ed.), Test policy and test performance: education, language and culture. Boston: Kluwer Academic Publishers.

Barnard, J. J. (1999). Item analysis in test construction. In G. N. Masters & J. P. Keeves

(Eds.), Advances in measurement in educational research and assessment. Amsterdam; New York: Pergamon.

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: fundamental

measurement in the human sciences (2nd ed.). New Jersey: Lawrence Erlbaum Associates, Publishers.

Brown, H. D. (2004). Language assessment: principles and classroom practices. New

York: Longman. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items (Vol. 4).

Thousand Oaks, California: SAGE Publications. Carlton, S. T., & Harris, A. M. (1989). Characteristics associated with differential item

performance on the SAT: gender and majority/minority group comparisons. Unpublished manuscript.

Carr, N. T. (2006). The factor structure of test task characteristics and examinee

performance. Language Testing, 23(3), 269-289. Carrell, P. L. (1988). Introduction: interactive approach to second language reading. In

P. L. Carrell, J. Devine & D. Eskey (Eds.), Interactive approach to second language reading. Cambridge: Cambridge University Press.

Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency

tests. Language Testing, 2(2), 155-163. Code of Fair Testing Practices in Education. (2004). Washington, DC: Joint Committee

on Testing Practices.

Collier, V. P. (1989). How long? A synthesis of research on academic achievement in a second language. TESOL Quarterly, 23(3), 509 - 531.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New

York: Holt, Rinehart and Winston. Davey, B. (1988). Factors affecting the difficulty of reading comprehension items for

successful and unsuccessful readers. Journal of Experimental Education, 56(2), 67 - 76.

89

Davey, B., & LaSasso, C. (1984). The interaction of reader and task factors in the assessment of reading comprehension. Journal of Experimental Education, 52(4), 199 - 206.

Davey, B., LaSasso, C., & Macready, G. (1983). A comparison of reading

comprehension task performance for dear and hearing readers. Journal of Speech and Hearing Research, 26, 622 - 628.

Davey, B., & Macready, G. (1985). Prerequisite relations among inference tasks for

skilled and less-skilled reader. journal of Educational Psychology, 77, 539 - 552. de Ayala, R. J. d. (2009). The theory and practice of item response theory. New York:

The Guilford Press. Devine, T. G. (1989). Teaching reading in the elementary school: from theory to

practice. Massachusetts, US: Allyn and Bacon, Inc. Dolittle, A., & Welch, C. (1989). Gender differences in performance on a college level

achievement test. Iowa City, IA: American College Testing Programme. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: mantel-haenszel

and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35 - 66). New Jersey: Lawrence Erlbaum Associates, Publishers.

Drum, P. A., Calfee, R. C., & Cook, L. K. (1981). The effects of surface structure

variables on performance in reading comprehension test. Reading Research Quarterly, 16(14), 486-514.

Ebel, R. L. (1982). Proposed solutions to two problems of test construction. Journal of

Educational Measurement, 19, 276 - 278. Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.).

Engelwood Cliffs, N.J: Prentice Hall. Elder, C. (1996). The effect of language background on foreign language test

performance: the case of Chinese, Italian, and modern Greek. Language Learning, 46, 233-282.

Embretson, S., & Wetzel, C. D. (1987). Component latent trait models for paragraph

comprehension. Applied Psychological Measurement, 11, 175 - 193. Fletcher, M. J. (2006). Measuring reading comprehension. Scientific study of Reading,

10, 323 - 330. Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading item difficulty:

implications for construct validity. Language Testing, 10(2), 133 - 167.

90

Freedle, R., & Kostin, I. (1994). Can multiple-choice reading tests be construct-valid? A reply to Katz, Lautenschlager, Blackburn and Harris. Psychological Science, 5, 107 - 110.

Hambleton, R. K. (1989). Principles and selected applications of item response theory.

In R. L. Linn (Ed.), Education measurement (3rd ed., pp. 147-200). New York: MacMillan Publishers.

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item

response theory and their applications to test development. Educational Measurement: Issues and Practices, 12(3), 38 - 47.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item

response theory. Newbury Park, California: SAGE Publications. Hamzah, M. S. G., & Abdullah, S. K. (2009). Analysis on metacognitive strategies in

reading and writing among Malaysian ESL learners in four education institutions. European Journal of Sciences, 11(4), 676 - 683.

Henning, G. (1984). Advantages of latent trait measurement in language testing.

Language Testing, 1(2), 123-133. Henning, G. (1987). A guide to language testing: development, evaluation, research.

Cambridge Mass: Newberry House Publisher. Henning, G., Hudson, T., & Turner, J. (1985). Item response theory and the assumption

of unidimensionality for language tests. Language Testing, 2(2), 141-154. Ibrahim, A. H. (2005). The effect of purposeful questioning technique in reading

performance. Paper presented at the 2nd National Seminar on Second/Foreign Language Learners and Learning.

Ibrahim, A. H. (2006). The process and problems of reading. Masalah Pendidikan, 115

- 129. Jalaluddin, N. H., Awal, N. M., & Bakar, K. A. (2009). Linguistics and environment in

English language learning: towards the development of quality human capital. European Journal of Sciences, 9(4), 627 - 642.

Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: individual

differences in working memory. Psychological Review, 99, 122 - 149. Kartz, S., & Lautenschlager, G. J. (1994). Answering reading comprehension questions

without passages on the SAT-I, ACT and GRE. Educational Assessment, 2, 295 - 308.

91

Kartz, S., & Lautenschlager, G. J. (2001). The contribution of passage and no-passage factors to item performance on the SAT reading task. Educational Assessment, 7(2), 165 - 176.

Kaur, S., & Thiyagarajah, R. (1999). The English reading habits of ELLs students in

University Science Malaysia. Paper presented at the 6th International Literacy and Education Research Network Conference on Learning.

Keeves, J. P., & Alagumalai, S. (1999). New approaches to measurement. In G. N.

Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 23-42). Amsterdam, New York: Pergamon.

Kirsch, I., Jong, J. d., LaFontaine, D., McQueen, J., Mendelovits, J., & Monseur, C.

(2002). Reading for change: performance and engagement across countries: result from PISA 2000. Paris: OECD.

Klapper, J. (1992). Reading in a foreign language; theoretical issues. Language

Learning, 1(5), 53 - 56. Kunnan, A. J. (1990). DIF in native language and gender groups in an ESL placement

test. TESOL Quarterly, 24, 741-746. Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and

validation in language assessment (pp. 1 - 14). Cambridge: Cambridge University Press.

Lord, F. M. (1980). Application of item response theory to practical testing problems.

New Jersey: Lawrence Erlbaum Associates Publishers. McKay, P. (2006). Assessing young language learners. Cambridge: Cambridge

University Press. McKenna, M. C., & Stahl, K. A. D. (2009). Assessment for reading instruction (2nd

ed.). New York: The Guilford Press. McNamara, T., & Roever, C. (2006). Language testing: the social dimension. Malden,

MA: Blackwell Publishing. McNamara, T. F. (1996). Measuring second language performance. London; New

York: Longman. McNamara, T. F. (2000). Language testing. New York: Oxford University Press. Mertler, C. A. (2007). Interpreting standardized test scores: strategies for data-driven

instructional decision making. Los Angeles: SAGE Publications.

92

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50 (9), 741-749.

Messick, S. (1996). Validity and washback in language testing. Language Testing 13

(3), 241-256. Mosenthal, P. (1996). Understanding the strategies of document literacy and their

conditions of use. Journal of Educational Psychology, 88, 314 - 332. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York:

McGraw-Hill, Inc. O'Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are

associated with differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Lawrence Erlbaum Associates.

Osterlind, S. J., & Everson, H. T. (2009). Differential Item Functioning (2nd ed.).

Thousand Oaks, CA: SAGE Publications. Ozuru, Y., Best, R., Bell, C., Witherspoon, A., & McNamara, D. S. (2007). Influence of

question format and text availability on the assessment of expository text comprehension. Cognition and Instruction, 25(4), 399 - 438.

Ozuru, Y., Rowe, M., O'Reilly, T., & McNamara, D. S. (2008). Where's the difficulty in

standardized reading tests: the passage or the question? Behaviour Research Methods, 40(4), 1001 - 1015.

Pae, T.-I. (2004). DIF for examinees with different academic background. Language

Testing, 21(1), 53 - 73. Pearson, P. D., & Johnson, D. D. (1978). Teaching reading comprehension. New York,

NJ: Holt, Rinehart and Winston. Perfetti, C. (1985). reading ability. New York: Oxford University Press. Perkins, K., & Miller, L. D. (1984). Comparative analyses of English as a second

language reading comprehension data: classical test theory and latent trait measurement. Language Testing, 1(1), 20-31.

Popham, W. J. (2000). Modern educational measurement: practical guidelines for

educational leaders (3rd ed.). Boston: Allyn and Bacon. Pumfrey, P. D. (1976). Reading: tests and assessment techniques. London: Hodder and

Stoughton.

93

Ramaiah, M., & Nambiar, M. K. (1993). Do undergraduates understand what they read: an investigation into the comprehension monitoring of ESL students through the use of textual anomalies. Journal of Educational Research, 15, 95 - 106.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment test.

Copenhagen: Danmarks Paedogogiske Institut. Reynolds, C. R., Livingston, R. B., & Wilson, V. (2009). Measurement and assessment

in education. Upper Saddle River, N.J: Pearson Merrill. Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and

CART to understand difficulty in second language reading and listening comprehension test. International Journal of Testing, 1(3 & 4), 185 - 216.

Sarudin, I., & Zubairy, A. M. (2008). Assessment of language proficiency of university

students. Paper presented at the 34th International Association for Educational Assessment (IAEA).

Scheuneman, J. D. (1982). A posteriori analyses of biased items. In R. A. Berk (Ed.),

Handbook of methods for detecting test bias (pp. 180 - 197). Baltimore and London: The Johns Hopkins University Press.

Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures

to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27(2), 109-131.

Schwartz, S. (1984). Measuring reading competence: a theoretical-prescriptive

approach. New York: Plenum Press. Sheehan, K. M., & Ginther, A. (2001). what do passage-based multiple-choice verbal

reasoning items really measure? an analysis of the cognitive skills underlying performance on the current TOEFL reading section. Paper presented at the the 2001 Annual Meeting of the National Council of Measurement in Education.

Shepard, L. A., Camilli, G., & Averill, M. (1981). Comparison of procedures for

detecting test-item bias with both internal and external ability criteria. Journal of Educational Statistics, 6, 317 - 375.

Snow, C. E. (2003). Assessment of reading comprehension. In A. P. Sweet & C. E.

Snow (Eds.), Rethinking reading comprehension (pp. 192 - 218). New York: Guilford.

Stephanou, A., Anderson, P., & Urbach, D. (2008). PAT-R Progressive Achievement

Tests in Reading: comprehension, vocabulary and spelling (4th ed.). Camberwell, Victoria: Australian Council for Educational Research.

Twist, L., & Sainsbury, M. (2009). Girl friendly? Investigating the gender gap in

national reading tests at age 11. Educational Research, 51(2), 283-297.

94

Weaver, C. A., & Kintsch, W. (1991). Expository text. In R. Barr, M. L. Kamil, P. Mosenthal & P. D. Pearson (Eds.), Handbook of reading research (Vol. 2). New York: Longman.

Woods, A., & Baker, R. (1985). Item response theory. Language Testing, 2(2), 117-140. Wright, B. D. (1999). Rasch measurement models. In G. N. Masters & J. P. Keeves

(Eds.), Advances in measurement in educational research and assessment. Amsterdan; New York: Pergamon.

Wu, M. (2010). Using item response theory as a tool in educational measurement.

Unpublished book chapter. University of Melbourne. Wu, M. L., & Adams, R. J. (2008). Properties of Rasch residual fit statistics.

Unpublished paper. Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER ConQuest

Version 2.0. Camberwell, Victoria: ACER Press. Zubairi, A. M., & Kassim, N. L. A. (2006). Classical and Rasch analyses of

dichotomously scored reading comprehension test items. Malaysian Journal of ELT Research, 2, 1-20.

Zumbo, B. D. (1999). A handbook on the theory and methods of differential item

functioning: logistic regression modelling as unitary framework for binary and likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defence.

95

APPENDICES

APPENDIX A

BAND DESCRIPTOR OF MUET

AGGREGATED

SCORE

BAND USER COMMUNICATIVE

ABILITY

COMPREHENSION TASK

PERFORMANCE

260 – 300 6 Highly proficient user

Very fluent; highly appropriate use of language; hardly any grammatical error

Very good understanding of language and context

Very high ability to function in the language

220 – 259 5 Proficient user

Fluent; appropriate use of language; few grammatical errors

Good understanding of language and context

High ability to function in the language

180 – 219 4 Satisfactory user

Generally fluent; generally appropriate use of language; some grammatical errors

Satisfactory understanding of language and context

Satisfactory ability to function in the language

140 – 179 3 Modest user Fairly fluent; fairly appropriate use of language; many grammatical errors

Fair understanding of language and context

Fair ability to function in the language

100 – 139 2 Limited user Not fluent; inappropriate use of language; very frequent grammatical errors

Limited understanding of language and context

Limited ability to function in the language

Below 100 1 Very limited user

Hardly able to use the language

Very limited understanding of language and context

Very limited ability to function in the language

96

APPENDIX B

CODING SCHEME OF TEST ITEMS

Type of variable

Predictors Characteristics of item Code

Passage Length of the passage

More than 25 lines More than 35 lines More than 45 lines

1 2 3

Question type

Type of question

Retrieving directly stated information (RI) Interpreting explicit information (IE) Interpreting by making inference (II) Reflecting on texts (RF)

1 2 3 4

Inference level No inference Paraphrase inference Across-sentence (bridging) inference Macrostructure (gist) inference Reader-based (prior knowledge) inference

1 2 3 4 5

Question format

Structure of the responses/ alternatives

Short lists of 1-4 words Simple phrases Complex phrases containing clauses that could stand alone as sentences

1 2 3

Number of options

3-option 4-option

1 2

Plausibility of the distractors

No response options are plausible One or more options are plausible

1 2

97

APPENDIX C

CONQUEST COMMAND FILES

1) Command File for CTT and Rasch Item Analysis

2) Command File for DIF Gender

datafile readingmuet.txt; format id 1-4 state 6-17 gender 19 ethnic 23 responses 27-71; key BCBAABCABCBCACCABBBABCACBCABBCCCABBCDDACBDDAC!1; keepcases L,P!gender; model gender + item + gender*item; estimate; show !estimate=latent >> DIFreadingmuet3.shw; itanal >> DIFreadingmuet3.itn;

datafile readingmuet.txt; format id 1-4 state 6-17 gender 19 ethnic 23 responses 27-71; key BCBAABCABCBCACCABBBABCACBCABBCCCABBCDDACBDDAC!1; group gender; model item; estimate; show !estimate=latent >> readingmuet2.shw; itanal >> readingmuet2.itn;

Documents

testing reading